SlideShare une entreprise Scribd logo
1  sur  38
1
Deploying Apache Flume for
Low-Latency Analytics
Mike Percy
Software Engineer, Cloudera
PMC Member & Committer, Apache Flume
About me
2
• Software engineer @ Cloudera
• Previous life @ Yahoo! in CORE
• PMC member / commiter on Apache Flume
• Part-time MSCS student @ Stanford
• I play the drums (when I can find the time!)
• Twitter: @mike_percy
What do you mean by low-latency?
3
• Shooting for a 2-minute SLA
• from time of occurrence
• to visibility via interactive SQL
• Pretty modest SLA
• The goal is fast access to all of our “big data”
• from a single SQL interface
Approach
4
• Stream Avro data into HDFS using Flume
• Lots of small files
• Convert the Avro data to Parquet in batches
• Compact into big files
• Efficiently query both data sets using Impala
• UNION ALL or (soon) VIEW
• Eventually delete the small files
Agenda
5
1. Using Flume
• Streaming Avro into HDFS using Flume
2. Using Hive & Impala
• Batch-converting older Avro data to Parquet
• How to query the whole data set (new and old together)
3. Gotchas
4. A little about HBase
github.com/mpercy/flume-rtq-hadoop-summit-2013
Why Flume event streaming is awesome
• Couldn’t I just do this with a shell script?
• What year is this, 2001? There is a better way!
• Scalable collection, aggregation of events (i.e. logs)
• Dynamic, contextual event routing
• Low latency, high throughput
• Declarative configuration
• Productive out of the box, yet powerfully extensible
• Open source software
6
Flume Agent design
7
Multi-agent network architecture
8
Typical fan-in topology
Shown here:
• Many agents (1 per host) at edge tier
• 4 “collector” agents in that data center
• 3 centralized agents writing to HDFS
Events
Flume’s core data movement atom: the Event
• An Event has a simple schema
• Event header: Map<String, String>
• similar in spirit to HTTP headers
• Event body: array of bytes
9
Channels
• Passive Component
• Channel type determines the reliability guarantees
• Stock channel types:
• JDBC – has performance issues
• Memory – lower latency for small writes, but not durable
• File – provides durability; most people use this
10
Sources
• Event-driven or polling-based
• Most sources can accept batches of events
• Stock source implementations:
• Avro-RPC – other Java-based Flume agents can send data to this source port
• Thrift-RPC – for interop with Thrift clients written in any language
• HTTP – post via a REST service (extensible)
• JMS – ingest from Java Message Service
• Syslog-UDP, Syslog-TCP
• Netcat
• Scribe
• Spooling Directory – parse and ingest completed log files
• Exec – execute shell commands and ingest the output
11
Sinks
• All sinks are polling-based
• Most sinks can process batches of events at a time
• Stock sink implementations (as of 1.4.0 RC1):
• HDFS
• HBase (2 variants)
• SolrCloud
• ElasticSearch
• Avro-RPC, Thrift-RPC – Flume agent inter-connect
• File Roller – write to local filesystem
• Null, Logger, Seq, Stress – for testing purposes
12
Flume component interactions
• Source: Puts events into the local channel
• Channel: Store events until someone takes them
• Sink: Takes events from the local channel
• On failure, sinks backoff and retry forever until success
13
Flume configuration example
agent1.properties:
# Flume will start these named components
agent1.sources = src1
agent1.channels = ch1
agent1.sinks = sink1
# channel config
agent1.channels.ch1.type = memory
# source config
agent1.sources.src1.type = netcat
agent1.sources.src1.channels = ch1
agent1.sources.src1.bind = 127.0.0.1
agent1.sources.src1.port = 8080
# sink config
agent1.sinks.sink1.type = logger
agent1.sinks.sink1.channel = ch1
14
Our Flume config
agent.channels = mem1
agent.sources = netcat
agent.sinks = hdfs1 hdfs2 hdfs3 hdfs4
agent.channels.mem1.type = memory
agent.channels.mem1.capacity = 1000000
agent.channels.mem1.transactionCapacity = 10000
agent.sources.netcat.type = netcat
agent.sources.netcat.channels = mem1
agent.sources.netcat.bind = 0.0.0.0
agent.sources.netcat.port = 8080
agent.sources.netcat.ack-every-event = false
agent.sinks.hdfs1.type = hdfs
agent.sinks.hdfs1.channel = mem1
agent.sinks.hdfs1.hdfs.useLocalTimeStamp = true
agent.sinks.hdfs1.hdfs.path =
hdfs:///flume/webdata/avro/year=%Y/month=%m/da
y=%d
agent.sinks.hdfs1.hdfs.fileType = DataStream
agent.sinks.hdfs1.hdfs.inUsePrefix = .
agent.sinks.hdfs1.hdfs.filePrefix = log
agent.sinks.hdfs1.hdfs.fileSuffix = .s1.avro
agent.sinks.hdfs1.hdfs.batchSize = 10000
agent.sinks.hdfs1.hdfs.rollInterval = 30
agent.sinks.hdfs1.hdfs.rollCount = 0
agent.sinks.hdfs1.hdfs.rollSize = 0
agent.sinks.hdfs1.hdfs.idleTimeout = 60
agent.sinks.hdfs1.hdfs.callTimeout = 25000
agent.sinks.hdfs1.serializer =
com.cloudera.flume.demo.CSVAvroSerializer$Builder
agent.sinks.hdfs2.type = hdfs
agent.sinks.hdfs2.channel = mem1
# etc … sink configs 2 thru 4 omitted for brevity
15
Avro schema
16
{
"type": "record",
"name": "UserAction",
"namespace": "com.cloudera.flume.demo",
"fields": [
{"name": "txn_id", "type": "long"},
{"name": ”ts", "type": "long"},
{"name": "action", "type": "string"},
{"name": "product_id", "type": "long"},
{"name": "user_ip", "type": "string"},
{"name": "user_name", "type": "string"},
{"name": "path", "type": "string"},
{"name": "referer", "type": "string", "default":""}
]
}
Custom Event Parser / Avro Serializer
/** converts comma-separated-value (CSV) input
text into binary Avro output on HDFS */
public class CSVAvroSerializer extends
AbstractAvroEventSerializer<UserAction> {
private OutputStream out;
private CSVAvroSerializer(OutputStream out) {
this.out = out;
}
protected OutputStream getOutputStream() {
return out;
}
protected Schema getSchema() {
Schema schema = UserAction.SCHEMA$;
return schema;
}
protected UserAction convert(Event event) {
String line = new String(event.getBody());
String[] fields = line.split(",");
UserAction action = new UserAction();
action.setTxnId(Long.parseLong(fields[0]));
action.setTimestamp(Long.parseLong(fields[1]));
action.setAction(fields[2]);
action.setProductId(Long.parseLong(fields[3]));
action.setUserIp(fields[4]);
action.setUserName(fields[5]);
action.setPath(fields[6]);
action.setReferer(fields[7]);
return action;
}
public static class Builder {
// Builder impl omitted for brevity
}
}
17
And now our data flows into Hadoop
18
Send some data to Flume:
./generator.pl | nc localhost 8080
Flume does the rest:
$ hadoop fs -ls /flume/webdata/avro/year=2013/month=06/day=26
Found 16 items
/flume/webdata/avro/year=2013/month=06/day=26/log.1372244099633.s1.avro
/flume/webdata/avro/year=2013/month=06/day=26/log.1372244099633.s2.avro
/flume/webdata/avro/year=2013/month=06/day=26/log.1372244099633.s3.avro
/flume/webdata/avro/year=2013/month=06/day=26/log.1372244099633.s4.avro
/flume/webdata/avro/year=2013/month=06/day=26/log.1372244099634.s1.avro
/flume/webdata/avro/year=2013/month=06/day=26/log.1372244099634.s2.avro
/flume/webdata/avro/year=2013/month=06/day=26/log.1372244099634.s3.avro
/flume/webdata/avro/year=2013/month=06/day=26/log.1372244099634.s4.avro
. . .
Querying our data via SQL: Impala and Hive
19
Shares Everything Client-Facing
 Metadata (table definitions)
 ODBC/JDBC drivers
 SQL syntax (Hive SQL)
 Flexible file formats
 Machine pool
 Hue GUI
But Built for Different Purposes
 Hive: runs on MapReduce and ideal for batch
processing
 Impala: native MPP query engine ideal for
interactive SQL
Storage
Integration
Resource Management
Metadata
HDFS HBase
TEXT, RCFILE, PARQUET, AVRO, ETC. RECORDS
Hive
SQL Syntax Impala
SQL Syntax +
Compute FrameworkMapReduce
Compute Framework
Batch
Processing
Interactive
SQL
Cloudera Impala
20
Interactive SQL for Hadoop
 Responses in seconds
 Nearly ANSI-92 standard SQL with Hive SQL
Native MPP Query Engine
 Purpose-built for low-latency queries
 Separate runtime from MapReduce
 Designed as part of the Hadoop ecosystem
Open Source
 Apache-licensed
Impala Query Execution
21
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL request
1) Request arrives via ODBC/JDBC/Beeswax/Shell
Impala Query Execution
22
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
2) Planner turns request into collections of plan fragments
3) Coordinator initiates execution on impalad(s) local to data
Impala Query Execution
23
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Hive
Metastore
HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
4) Intermediate results are streamed between impalad(s)
5) Query results are streamed back to client
Query results
Parquet File Format: http://parquet.io
24
Open source, columnar file format for
Hadoop developed by Cloudera & Twitter
Rowgroup format: file contains multiple horiz. slices
Supports storing each column in a separate file
Supports fully shredded nested data
Column values stored in native types
Supports index pages for fast lookup
Extensible value encodings
Table setup (using the Hive shell)
25
CREATE EXTERNAL TABLE webdataIn
PARTITIONED BY (year int, month int, day int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/flume/webdata/avro/’
TBLPROPERTIES
('avro.schema.url’ = 'hdfs:///user/mpercy/UserAction.avsc');
/* this must be done for every new day: */
ALTER TABLE webdataIn
ADD PARTITION (year = 2013, month = 06, day = 20);
Conversion to Parquet
Create the archive table from Impala:
impala> CREATE TABLE webdataArchive
(txn_id bigint, ts bigint, action string, product_id bigint, user_ip string,
user_name string, path string, referer string)
PARTITIONED BY (year int, month int, day int)
STORED AS PARQUETFILE;
26
Conversion to Parquet
Load the Parquet table from an Impala SELECT:
ALTER TABLE webdataArchive
ADD PARTITION (year = 2013, month = 6, day = 20);
INSERT OVERWRITE TABLE webdataArchive
PARTITION (year = 2013, month = 6, day = 20)
SELECT txn_id, ts, action, product_id, user_ip, user_name, path, referer
FROM webdataIn
WHERE year = 2013 AND month = 6 AND day = 20;
27
Partition update & metadata refresh
Impala can’t see the Avro data until you add the
partition and refresh the metadata for the table
• Pre-create partitions on a daily basis
hive> ALTER TABLE webdataIn
ADD PARTITION (year = 2013, month = 6, day = 20);
• Refresh metadata for incoming table every minute
impala> REFRESH webdataIn
28
Querying against both data sets
Query against the correct table based on job schedule
SELECT txn_id, product_id, user_name, action
FROM webdataIn
WHERE year = 2013 AND month = 6
AND (day = 25 OR day = 26)
AND product_id = 30111
UNION ALL
SELECT txn_id, product_id, user_name, action
FROM webdataArchive
WHERE NOT (year = 2013 AND month = 6
AND (day = 25 OR day = 26))
AND product_id = 30111;
29
Query results
+---------------+------------+------------+--------+
| txn_id | product_id | user_name | action |
+---------------+------------+------------+--------+
| 1372244973556 | 30111 | ylucas | click |
| 1372244169096 | 30111 | fdiaz | click |
| 1372244563287 | 30111 | vbernard | view |
| 1372244268481 | 30111 | ntrevino | view |
| 1372244381025 | 30111 | oramsey | click |
| 1372244242387 | 30111 | wwallace | click |
| 1372244227407 | 30111 | thughes | click |
| 1372245609209 | 30111 | gyoder | view |
| 1372245717845 | 30111 | yblevins | view |
| 1372245733985 | 30111 | moliver | click |
| 1372245756852 | 30111 | pjensen | view |
| 1372245800509 | 30111 | ucrane | view |
| 1372244848845 | 30111 | tryan | click |
| 1372245073616 | 30111 | pgilbert | view |
| 1372245141668 | 30111 | walexander | click |
| 1372245362977 | 30111 | agillespie | click |
| 1372244892540 | 30111 | ypham | view |
| 1372245466121 | 30111 | yowens | buy |
| 1372245515189 | 30111 | emurray | view |
+---------------+------------+------------+--------+
30
Views
• It would be much better to be able to define a view
to simplify your queries instead of using UNION ALL
• Hive currently supports views
• Impala 1.1 will add VIEW support
• Available next month (July 2013)
31
Other considerations
• Don’t forget to delete the small files once they have
been compacted
• Keep your incoming data table to a minimum
• REFRESH <table> is expensive
• Flume may write duplicates
• Your webdataIn table may return duplicate entries
• Consider doing a de-duplication job via M/R before the
INSERT OVERWRITE
• Your schema should have keys to de-dup on
• DISTINCT might help with both of these issues
32
Even lower latency via Flume & HBase
• HBase
• Flume has 2 HBase sinks:
• ASYNC_HBASE is faster, doesn’t support Kerberos or 0.96 yet
• HBASE is slower but uses the stock libs and supports everything
• Take care of de-dup by using idempotent operations
• Write to unique keys
• Stream + Batch pattern is useful here if you need counters
• Approximate counts w/ real-time increment ops
• Go back and reconcile the streaming numbers with batch-
generated accurate numbers excluding any duplicates created by
retries
33
Flume ingestion strategies
• In this example, we used “netcat” because it’s easy
• For a system that needs reliable delivery we should
use something else:
• Flume RPC client API
• Thrift RPC
• HTTP source (REST)
• Spooling Directory source
• Memory channel vs. File channel
34
For reference
Flume docs
• User Guide: flume.apache.org/FlumeUserGuide.html
• Dev Guide:
flume.apache.org/FlumeDeveloperGuide.html
Impala tutorial:
• www.cloudera.com/content/cloudera-
content/cloudera-docs/Impala/latest/Installing-and-
Using-Impala/ciiu_tutorial.html
35
Flume 1.4.0
• Release candidate is out for version 1.4.0
• Adds:
• SSL connection security for Avro source
• Thrift source
• SolrCloud sink
• Flume embedding support (embedded agent)
• File channel performance improvements
• Easier plugin integration
• Many other improvements & bug fixes
• Try it: http://people.apache.org/~mpercy/flume/apache-
flume-1.4.0-RC1/
36
Flume: How to get involved!
• Join the mailing lists:
• user-subscribe@flume.apache.org
• dev-subscribe@flume.apache.org
• Look at the code
• github.com/apache/flume – Mirror of the
flume.apache.org git repo
• File a bug (or fix one!)
• issues.apache.org/jira/browse/FLUME
• More on how to contribute:
• cwiki.apache.org/confluence/display/FLUME/How+to+Con
tribute
37
38
Thank You!
Mike Percy
Work: Software Engineer, Cloudera
Apache: PMC Member & Committer, Apache Flume
Twitter: @mike_percy
Code: github.com/mpercy/flume-rtq-hadoop-summit-2013

Contenu connexe

Tendances

Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupGwen (Chen) Shapira
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBasedave_revell
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoopChristophe Marchal
 
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...StreamNative
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupGwen (Chen) Shapira
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLCloudera, Inc.
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application ResourcesDataWorks Summit
 
Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudDataWorks Summit
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureGwen (Chen) Shapira
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduDataWorks Summit
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)Chris Nauroth
 
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketHBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketCloudera, Inc.
 
Apache Kafka
Apache KafkaApache Kafka
Apache KafkaJoe Stein
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwordsSzehon Ho
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server TalkEvan Chan
 

Tendances (20)

Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoop
 
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
Introducing HerdDB - a distributed JVM embeddable database built upon Apache ...
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn Meetup
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resources
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloud
 
Simplified Cluster Operation & Troubleshooting
Simplified Cluster Operation & TroubleshootingSimplified Cluster Operation & Troubleshooting
Simplified Cluster Operation & Troubleshooting
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
 
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketHBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwords
 
Spark Summit 2014: Spark Job Server Talk
Spark Summit 2014:  Spark Job Server TalkSpark Summit 2014:  Spark Job Server Talk
Spark Summit 2014: Spark Job Server Talk
 

En vedette

ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016Jayesh Thakrar
 
Apache Flume
Apache FlumeApache Flume
Apache FlumeGetInData
 
Twitter Storm: Ereignisverarbeitung in Echtzeit
Twitter Storm: Ereignisverarbeitung in EchtzeitTwitter Storm: Ereignisverarbeitung in Echtzeit
Twitter Storm: Ereignisverarbeitung in EchtzeitGuido Schmutz
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917Chicago Hadoop Users Group
 
Making Big Data Analytics Interactive and Real-­Time
 Making Big Data Analytics Interactive and Real-­Time Making Big Data Analytics Interactive and Real-­Time
Making Big Data Analytics Interactive and Real-­TimeSeven Nguyen
 
Monitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDBMonitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDBGeoffrey Anderson
 
How we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBaseHow we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBaseDataWorks Summit
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013Nathan Bijnens
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexasArvind Prabhakar
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponCloudera, Inc.
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBaseHortonworks
 

En vedette (20)

Apache Flume (NG)
Apache Flume (NG)Apache Flume (NG)
Apache Flume (NG)
 
Flume and HBase
Flume and HBase Flume and HBase
Flume and HBase
 
Inside Flume
Inside FlumeInside Flume
Inside Flume
 
ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Flume vs. kafka
Flume vs. kafkaFlume vs. kafka
Flume vs. kafka
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Twitter Storm: Ereignisverarbeitung in Echtzeit
Twitter Storm: Ereignisverarbeitung in EchtzeitTwitter Storm: Ereignisverarbeitung in Echtzeit
Twitter Storm: Ereignisverarbeitung in Echtzeit
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Making Big Data Analytics Interactive and Real-­Time
 Making Big Data Analytics Interactive and Real-­Time Making Big Data Analytics Interactive and Real-­Time
Making Big Data Analytics Interactive and Real-­Time
 
Monitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDBMonitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDB
 
How we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBaseHow we solved Real-time User Segmentation using HBase
How we solved Real-time User Segmentation using HBase
 
Intro to Pig UDF
Intro to Pig UDFIntro to Pig UDF
Intro to Pig UDF
 
A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013A real time architecture using Hadoop and Storm @ FOSDEM 2013
A real time architecture using Hadoop and Storm @ FOSDEM 2013
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexas
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Spark and shark
Spark and sharkSpark and shark
Spark and shark
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 

Similaire à Deploying Apache Flume to enable low-latency analytics

Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrTimothy Spann
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and ThenSATOSHI TAGOMORI
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Chris Fregly
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
 
Fluentd at HKOScon
Fluentd at HKOSconFluentd at HKOScon
Fluentd at HKOSconN Masahiro
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Data Con LA
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with FlumeRatnakar Pawar
 
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache PulsarApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache PulsarTimothy Spann
 
NoSQL afternoon in Japan Kumofs & MessagePack
NoSQL afternoon in Japan Kumofs & MessagePackNoSQL afternoon in Japan Kumofs & MessagePack
NoSQL afternoon in Japan Kumofs & MessagePackSadayuki Furuhashi
 
NoSQL afternoon in Japan kumofs & MessagePack
NoSQL afternoon in Japan kumofs & MessagePackNoSQL afternoon in Japan kumofs & MessagePack
NoSQL afternoon in Japan kumofs & MessagePackSadayuki Furuhashi
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU:  Deep Dive into Building Streaming Applications with Apache PulsarOSS EU:  Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU: Deep Dive into Building Streaming Applications with Apache PulsarTimothy Spann
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Building Distributed Data Streaming System
Building Distributed Data Streaming SystemBuilding Distributed Data Streaming System
Building Distributed Data Streaming SystemAshish Tadose
 

Similaire à Deploying Apache Flume to enable low-latency analytics (20)

Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solr
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and Then
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Fluentd at HKOScon
Fluentd at HKOSconFluentd at HKOScon
Fluentd at HKOScon
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
 
Running php on nginx
Running php on nginxRunning php on nginx
Running php on nginx
 
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache PulsarApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
 
NoSQL afternoon in Japan Kumofs & MessagePack
NoSQL afternoon in Japan Kumofs & MessagePackNoSQL afternoon in Japan Kumofs & MessagePack
NoSQL afternoon in Japan Kumofs & MessagePack
 
NoSQL afternoon in Japan kumofs & MessagePack
NoSQL afternoon in Japan kumofs & MessagePackNoSQL afternoon in Japan kumofs & MessagePack
NoSQL afternoon in Japan kumofs & MessagePack
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU:  Deep Dive into Building Streaming Applications with Apache PulsarOSS EU:  Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Building Distributed Data Streaming System
Building Distributed Data Streaming SystemBuilding Distributed Data Streaming System
Building Distributed Data Streaming System
 
Top ten-list
Top ten-listTop ten-list
Top ten-list
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 

Dernier (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 

Deploying Apache Flume to enable low-latency analytics

  • 1. 1 Deploying Apache Flume for Low-Latency Analytics Mike Percy Software Engineer, Cloudera PMC Member & Committer, Apache Flume
  • 2. About me 2 • Software engineer @ Cloudera • Previous life @ Yahoo! in CORE • PMC member / commiter on Apache Flume • Part-time MSCS student @ Stanford • I play the drums (when I can find the time!) • Twitter: @mike_percy
  • 3. What do you mean by low-latency? 3 • Shooting for a 2-minute SLA • from time of occurrence • to visibility via interactive SQL • Pretty modest SLA • The goal is fast access to all of our “big data” • from a single SQL interface
  • 4. Approach 4 • Stream Avro data into HDFS using Flume • Lots of small files • Convert the Avro data to Parquet in batches • Compact into big files • Efficiently query both data sets using Impala • UNION ALL or (soon) VIEW • Eventually delete the small files
  • 5. Agenda 5 1. Using Flume • Streaming Avro into HDFS using Flume 2. Using Hive & Impala • Batch-converting older Avro data to Parquet • How to query the whole data set (new and old together) 3. Gotchas 4. A little about HBase github.com/mpercy/flume-rtq-hadoop-summit-2013
  • 6. Why Flume event streaming is awesome • Couldn’t I just do this with a shell script? • What year is this, 2001? There is a better way! • Scalable collection, aggregation of events (i.e. logs) • Dynamic, contextual event routing • Low latency, high throughput • Declarative configuration • Productive out of the box, yet powerfully extensible • Open source software 6
  • 8. Multi-agent network architecture 8 Typical fan-in topology Shown here: • Many agents (1 per host) at edge tier • 4 “collector” agents in that data center • 3 centralized agents writing to HDFS
  • 9. Events Flume’s core data movement atom: the Event • An Event has a simple schema • Event header: Map<String, String> • similar in spirit to HTTP headers • Event body: array of bytes 9
  • 10. Channels • Passive Component • Channel type determines the reliability guarantees • Stock channel types: • JDBC – has performance issues • Memory – lower latency for small writes, but not durable • File – provides durability; most people use this 10
  • 11. Sources • Event-driven or polling-based • Most sources can accept batches of events • Stock source implementations: • Avro-RPC – other Java-based Flume agents can send data to this source port • Thrift-RPC – for interop with Thrift clients written in any language • HTTP – post via a REST service (extensible) • JMS – ingest from Java Message Service • Syslog-UDP, Syslog-TCP • Netcat • Scribe • Spooling Directory – parse and ingest completed log files • Exec – execute shell commands and ingest the output 11
  • 12. Sinks • All sinks are polling-based • Most sinks can process batches of events at a time • Stock sink implementations (as of 1.4.0 RC1): • HDFS • HBase (2 variants) • SolrCloud • ElasticSearch • Avro-RPC, Thrift-RPC – Flume agent inter-connect • File Roller – write to local filesystem • Null, Logger, Seq, Stress – for testing purposes 12
  • 13. Flume component interactions • Source: Puts events into the local channel • Channel: Store events until someone takes them • Sink: Takes events from the local channel • On failure, sinks backoff and retry forever until success 13
  • 14. Flume configuration example agent1.properties: # Flume will start these named components agent1.sources = src1 agent1.channels = ch1 agent1.sinks = sink1 # channel config agent1.channels.ch1.type = memory # source config agent1.sources.src1.type = netcat agent1.sources.src1.channels = ch1 agent1.sources.src1.bind = 127.0.0.1 agent1.sources.src1.port = 8080 # sink config agent1.sinks.sink1.type = logger agent1.sinks.sink1.channel = ch1 14
  • 15. Our Flume config agent.channels = mem1 agent.sources = netcat agent.sinks = hdfs1 hdfs2 hdfs3 hdfs4 agent.channels.mem1.type = memory agent.channels.mem1.capacity = 1000000 agent.channels.mem1.transactionCapacity = 10000 agent.sources.netcat.type = netcat agent.sources.netcat.channels = mem1 agent.sources.netcat.bind = 0.0.0.0 agent.sources.netcat.port = 8080 agent.sources.netcat.ack-every-event = false agent.sinks.hdfs1.type = hdfs agent.sinks.hdfs1.channel = mem1 agent.sinks.hdfs1.hdfs.useLocalTimeStamp = true agent.sinks.hdfs1.hdfs.path = hdfs:///flume/webdata/avro/year=%Y/month=%m/da y=%d agent.sinks.hdfs1.hdfs.fileType = DataStream agent.sinks.hdfs1.hdfs.inUsePrefix = . agent.sinks.hdfs1.hdfs.filePrefix = log agent.sinks.hdfs1.hdfs.fileSuffix = .s1.avro agent.sinks.hdfs1.hdfs.batchSize = 10000 agent.sinks.hdfs1.hdfs.rollInterval = 30 agent.sinks.hdfs1.hdfs.rollCount = 0 agent.sinks.hdfs1.hdfs.rollSize = 0 agent.sinks.hdfs1.hdfs.idleTimeout = 60 agent.sinks.hdfs1.hdfs.callTimeout = 25000 agent.sinks.hdfs1.serializer = com.cloudera.flume.demo.CSVAvroSerializer$Builder agent.sinks.hdfs2.type = hdfs agent.sinks.hdfs2.channel = mem1 # etc … sink configs 2 thru 4 omitted for brevity 15
  • 16. Avro schema 16 { "type": "record", "name": "UserAction", "namespace": "com.cloudera.flume.demo", "fields": [ {"name": "txn_id", "type": "long"}, {"name": ”ts", "type": "long"}, {"name": "action", "type": "string"}, {"name": "product_id", "type": "long"}, {"name": "user_ip", "type": "string"}, {"name": "user_name", "type": "string"}, {"name": "path", "type": "string"}, {"name": "referer", "type": "string", "default":""} ] }
  • 17. Custom Event Parser / Avro Serializer /** converts comma-separated-value (CSV) input text into binary Avro output on HDFS */ public class CSVAvroSerializer extends AbstractAvroEventSerializer<UserAction> { private OutputStream out; private CSVAvroSerializer(OutputStream out) { this.out = out; } protected OutputStream getOutputStream() { return out; } protected Schema getSchema() { Schema schema = UserAction.SCHEMA$; return schema; } protected UserAction convert(Event event) { String line = new String(event.getBody()); String[] fields = line.split(","); UserAction action = new UserAction(); action.setTxnId(Long.parseLong(fields[0])); action.setTimestamp(Long.parseLong(fields[1])); action.setAction(fields[2]); action.setProductId(Long.parseLong(fields[3])); action.setUserIp(fields[4]); action.setUserName(fields[5]); action.setPath(fields[6]); action.setReferer(fields[7]); return action; } public static class Builder { // Builder impl omitted for brevity } } 17
  • 18. And now our data flows into Hadoop 18 Send some data to Flume: ./generator.pl | nc localhost 8080 Flume does the rest: $ hadoop fs -ls /flume/webdata/avro/year=2013/month=06/day=26 Found 16 items /flume/webdata/avro/year=2013/month=06/day=26/log.1372244099633.s1.avro /flume/webdata/avro/year=2013/month=06/day=26/log.1372244099633.s2.avro /flume/webdata/avro/year=2013/month=06/day=26/log.1372244099633.s3.avro /flume/webdata/avro/year=2013/month=06/day=26/log.1372244099633.s4.avro /flume/webdata/avro/year=2013/month=06/day=26/log.1372244099634.s1.avro /flume/webdata/avro/year=2013/month=06/day=26/log.1372244099634.s2.avro /flume/webdata/avro/year=2013/month=06/day=26/log.1372244099634.s3.avro /flume/webdata/avro/year=2013/month=06/day=26/log.1372244099634.s4.avro . . .
  • 19. Querying our data via SQL: Impala and Hive 19 Shares Everything Client-Facing  Metadata (table definitions)  ODBC/JDBC drivers  SQL syntax (Hive SQL)  Flexible file formats  Machine pool  Hue GUI But Built for Different Purposes  Hive: runs on MapReduce and ideal for batch processing  Impala: native MPP query engine ideal for interactive SQL Storage Integration Resource Management Metadata HDFS HBase TEXT, RCFILE, PARQUET, AVRO, ETC. RECORDS Hive SQL Syntax Impala SQL Syntax + Compute FrameworkMapReduce Compute Framework Batch Processing Interactive SQL
  • 20. Cloudera Impala 20 Interactive SQL for Hadoop  Responses in seconds  Nearly ANSI-92 standard SQL with Hive SQL Native MPP Query Engine  Purpose-built for low-latency queries  Separate runtime from MapReduce  Designed as part of the Hadoop ecosystem Open Source  Apache-licensed
  • 21. Impala Query Execution 21 Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase SQL request 1) Request arrives via ODBC/JDBC/Beeswax/Shell
  • 22. Impala Query Execution 22 Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase 2) Planner turns request into collections of plan fragments 3) Coordinator initiates execution on impalad(s) local to data
  • 23. Impala Query Execution 23 Query Planner Query Coordinator Query Executor HDFS DN HBase SQL App ODBC Hive Metastore HDFS NN Statestore Query Planner Query Coordinator Query Executor HDFS DN HBase Query Planner Query Coordinator Query Executor HDFS DN HBase 4) Intermediate results are streamed between impalad(s) 5) Query results are streamed back to client Query results
  • 24. Parquet File Format: http://parquet.io 24 Open source, columnar file format for Hadoop developed by Cloudera & Twitter Rowgroup format: file contains multiple horiz. slices Supports storing each column in a separate file Supports fully shredded nested data Column values stored in native types Supports index pages for fast lookup Extensible value encodings
  • 25. Table setup (using the Hive shell) 25 CREATE EXTERNAL TABLE webdataIn PARTITIONED BY (year int, month int, day int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/flume/webdata/avro/’ TBLPROPERTIES ('avro.schema.url’ = 'hdfs:///user/mpercy/UserAction.avsc'); /* this must be done for every new day: */ ALTER TABLE webdataIn ADD PARTITION (year = 2013, month = 06, day = 20);
  • 26. Conversion to Parquet Create the archive table from Impala: impala> CREATE TABLE webdataArchive (txn_id bigint, ts bigint, action string, product_id bigint, user_ip string, user_name string, path string, referer string) PARTITIONED BY (year int, month int, day int) STORED AS PARQUETFILE; 26
  • 27. Conversion to Parquet Load the Parquet table from an Impala SELECT: ALTER TABLE webdataArchive ADD PARTITION (year = 2013, month = 6, day = 20); INSERT OVERWRITE TABLE webdataArchive PARTITION (year = 2013, month = 6, day = 20) SELECT txn_id, ts, action, product_id, user_ip, user_name, path, referer FROM webdataIn WHERE year = 2013 AND month = 6 AND day = 20; 27
  • 28. Partition update & metadata refresh Impala can’t see the Avro data until you add the partition and refresh the metadata for the table • Pre-create partitions on a daily basis hive> ALTER TABLE webdataIn ADD PARTITION (year = 2013, month = 6, day = 20); • Refresh metadata for incoming table every minute impala> REFRESH webdataIn 28
  • 29. Querying against both data sets Query against the correct table based on job schedule SELECT txn_id, product_id, user_name, action FROM webdataIn WHERE year = 2013 AND month = 6 AND (day = 25 OR day = 26) AND product_id = 30111 UNION ALL SELECT txn_id, product_id, user_name, action FROM webdataArchive WHERE NOT (year = 2013 AND month = 6 AND (day = 25 OR day = 26)) AND product_id = 30111; 29
  • 30. Query results +---------------+------------+------------+--------+ | txn_id | product_id | user_name | action | +---------------+------------+------------+--------+ | 1372244973556 | 30111 | ylucas | click | | 1372244169096 | 30111 | fdiaz | click | | 1372244563287 | 30111 | vbernard | view | | 1372244268481 | 30111 | ntrevino | view | | 1372244381025 | 30111 | oramsey | click | | 1372244242387 | 30111 | wwallace | click | | 1372244227407 | 30111 | thughes | click | | 1372245609209 | 30111 | gyoder | view | | 1372245717845 | 30111 | yblevins | view | | 1372245733985 | 30111 | moliver | click | | 1372245756852 | 30111 | pjensen | view | | 1372245800509 | 30111 | ucrane | view | | 1372244848845 | 30111 | tryan | click | | 1372245073616 | 30111 | pgilbert | view | | 1372245141668 | 30111 | walexander | click | | 1372245362977 | 30111 | agillespie | click | | 1372244892540 | 30111 | ypham | view | | 1372245466121 | 30111 | yowens | buy | | 1372245515189 | 30111 | emurray | view | +---------------+------------+------------+--------+ 30
  • 31. Views • It would be much better to be able to define a view to simplify your queries instead of using UNION ALL • Hive currently supports views • Impala 1.1 will add VIEW support • Available next month (July 2013) 31
  • 32. Other considerations • Don’t forget to delete the small files once they have been compacted • Keep your incoming data table to a minimum • REFRESH <table> is expensive • Flume may write duplicates • Your webdataIn table may return duplicate entries • Consider doing a de-duplication job via M/R before the INSERT OVERWRITE • Your schema should have keys to de-dup on • DISTINCT might help with both of these issues 32
  • 33. Even lower latency via Flume & HBase • HBase • Flume has 2 HBase sinks: • ASYNC_HBASE is faster, doesn’t support Kerberos or 0.96 yet • HBASE is slower but uses the stock libs and supports everything • Take care of de-dup by using idempotent operations • Write to unique keys • Stream + Batch pattern is useful here if you need counters • Approximate counts w/ real-time increment ops • Go back and reconcile the streaming numbers with batch- generated accurate numbers excluding any duplicates created by retries 33
  • 34. Flume ingestion strategies • In this example, we used “netcat” because it’s easy • For a system that needs reliable delivery we should use something else: • Flume RPC client API • Thrift RPC • HTTP source (REST) • Spooling Directory source • Memory channel vs. File channel 34
  • 35. For reference Flume docs • User Guide: flume.apache.org/FlumeUserGuide.html • Dev Guide: flume.apache.org/FlumeDeveloperGuide.html Impala tutorial: • www.cloudera.com/content/cloudera- content/cloudera-docs/Impala/latest/Installing-and- Using-Impala/ciiu_tutorial.html 35
  • 36. Flume 1.4.0 • Release candidate is out for version 1.4.0 • Adds: • SSL connection security for Avro source • Thrift source • SolrCloud sink • Flume embedding support (embedded agent) • File channel performance improvements • Easier plugin integration • Many other improvements & bug fixes • Try it: http://people.apache.org/~mpercy/flume/apache- flume-1.4.0-RC1/ 36
  • 37. Flume: How to get involved! • Join the mailing lists: • user-subscribe@flume.apache.org • dev-subscribe@flume.apache.org • Look at the code • github.com/apache/flume – Mirror of the flume.apache.org git repo • File a bug (or fix one!) • issues.apache.org/jira/browse/FLUME • More on how to contribute: • cwiki.apache.org/confluence/display/FLUME/How+to+Con tribute 37
  • 38. 38 Thank You! Mike Percy Work: Software Engineer, Cloudera Apache: PMC Member & Committer, Apache Flume Twitter: @mike_percy Code: github.com/mpercy/flume-rtq-hadoop-summit-2013

Notes de l'éditeur

  1. Four tier1 agents drain into one tier2 agent then distributes its load over two tier3 agentsYou can have a single config file for 3 agents and you pass that around your deployment and you’re doneAt any node, ingest rate must equal exit rate
  2. File channel is the recommended channel: reliable channel (no data loss in outage), scales linearly with additional spindles (more disks, better performance), better durability guarantees than memory channelMemory channel can’t scale to large capacity because bound by memoryJDBC not recommended due to slow performance (don’t mention deadlock)
  3. File channel is the recommended channel: reliable channel (no data loss in outage), scales linearly with additional spindles (more disks, better performance), better durability guarantees than memory channelMemory channel can’t scale to large capacity because bound by memoryJDBC not recommended due to slow performance (don’t mention deadlock)
  4. Avro is standardChannels support transactionsflume sources:avroexecsyslogspooling directoryhttpembedded agentJMS
  5. Polling semantics – sink continually polls to see if events are availableAsynchbase sink recommended over hbase sink (synchronous hbaseapi) for better performanceNull sink will drop events to the floor
  6. In upper diagram, the 3 agents’ flow is healthyIn lower diagram, sink fails to communicate with downstream source thus reservoir fills up and the reservoir filling up cascades upstream, buffering from downstream hardware failuresBut no events are lost until all channels in that flow fill up, at which point the sources report failure to the clientSteadystate flow restored when link becomes active
  7. Interactive SQL for HadoopResponses in seconds vs. minutes or hours4-100x faster than HiveNearly ANSI-92 standard SQL with HiveQLCREATE, ALTER, SELECT, INSERT, JOIN, subqueries, etc.ODBC/JDBC drivers Compatible SQL interface for existing Hadoop/CDH applicationsNative MPP Query EnginePurpose-built for low latency queries – another application being brought to HadoopSeparate runtime from MapReduce which is designed for batch processingTightly integrated with Hadoop ecosystem – major design imperative and differentiator for ClouderaSingle system (no integration)Native, open file formats that are compatible across the ecosystem (no copying)Single metadata model (no synchronization)Single set of hardware and system resources (better performance, lower cost)Integrated, end-to-end security (no vulnerabilities)Open SourceKeeps with our strategy of an open platform – i.e. if it stores or processes data, it’s open sourceApache-licensedCode available on Github