Apache Flume and its use case in Manufacturing

Final Project 

Apache Flume 
 

Rapheephan Thongkham-Uan (Nancy)

cscie90 Cloud Computing
Harvard University Extension School
Prof. Zoran B. Djordjević

@TakeshiDemonkey

1

What is Apache Flume?
▪ Flume is distributed, reliable, and available service for efficiently
collecting, aggregating, and moving large amounts of log data from
many different sources to a centralised data store. (http://
flume.apache.org/FlumeUserGuide.html)

!
!
!
!
!
!
!
▪ Currently available versions are 0.9.x and 1.x
▪ I want to focus on Flume use cases in manufacturing.

@Rapheephan

2

Applying Flume to Manufacturing Process
▪ In the factory, there are many machines used in the production.

!
!
!
!
!
▪ If a machine produces 1 log data file when 1 lot of product finishes
processing. In one day, there will be a big amount of log data stored
in the server.
▪ For the quality control and the production control improvement,
analysing these log files in a real time is our objective.
▪ First, we need to collect these log data files from the production
lines into the HDFS, then pass them through the analysis process.

@Rapheephan

3

Multi-agent flow image in the production system

AGENT 1

consolidation

AGENT 2

AGENT 4
HDFS

AGENT 3

@Rapheephan

4

My Sample
agent 1
CHANNEL

SOURCE

SINK

HDFS

▪ My system
▪ Java Runtime Environment (Java1.6.0_31)
▪ Cloudera's Distribution Including Apache Hadoop (CDH4.3)

▪ Working steps
1. Install Apache Flume on the Host machine
(Flume installation guide for CDH4: http://www.cloudera.com/content/cloudera-content/
cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_12.html)

2. Create 2 log generation java applications for machine1 and machine2
3. Configure Flume agent
4. Start Flume agent and test the system

@Rapheephan

5

Prepare the log generation application
▪ Create 2 Virtual Machines for generating machine1’s and machine2’s
log data.
▪ Create a simple socket java program to producing log events to
agent’s source with specific port (11111)

!
!
!
!
!
!
!
▪ Export it as an executable JAR file, and move it to the virtual
machine1
▪ Copy and move the other to the virtual machine2
@Rapheephan

6

Configuration Flume-ng agent on Host
▪ We have to configure all sink, channel, and source in the flow. My
agent name is hdfs-agent
▪ First, name the components in the agent.

! hdfs-agent.sources = log-collect
= memoryChannel
! hdfs-agent.channelshdfs-write
hdfs-agent.sinks =
!
▪ Next, define the source’s properties as follow

! hdfs-agent.sources.log-collect.type = netcat
! hdfs-agent.sources.log-collect.bind = 133.196.211.209
hdfs-agent.sources.log-collect.port = 11111
!!
hdfs-agent.sources.log-collect.channels = memoryChannel
!
▪ My source type is netcat-like source, that listens on the port ‘11111’
▪ Don’t forget to define the channel used by the source.
@Rapheephan

7

Configuration Flume-ng agent on Host (2)
▪ We want to collect the log data and write to the ‘testFlume’
directory on the HDFS cluster. Therefore, the sink should be defined
as follow.

! hdfs-agent.sinks.hdfs-write.type = hdfs
! hdfs-agent.sinks.hdfs-write.hdfs.path = hdfs://<namenode>/user/
<myusername>/testflume
= Text
! hdfs-agent.sinks.hdfs-write.hdfs.writeFormatDataStream
hdfs-agent.sinks.hdfs-write.hdfs.fileType =
!!
hdfs-agent.sinks.hdfs-write.channel = memoryChannel
!
▪ Don’t forget to specify the channel used by the sink.
▪ Finally, configure the channel

! hdfs-agent.channels.memoryChannel.type = memory
hdfs-agent.channels.memoryChannel.capacity = 1000
!
▪ The channel will store the log data in-memory with the maximum
1000 events.
@Rapheephan

8

Start the Flume agent and get result
▪ My configuration file name is ‘flume.conf’, and my agent name is
‘hdfs-agent’.
▪ Start the Flume agent using the following command.
$! flume-ng agent --conf-file flume.conf --name hdfs-agent

▪ Execute the genLog.jar on both machines
▪ On Flume master, you will be able to see something like this

!
!
!
!

13/12/17 14:36:13 INFO hdfs.BucketWriter: Creating hdfs://<namenode>:
8020/user/<my userid>/testflume/FlumeData.1387258572230.tmp
13/12/17 14:36:19 INFO hdfs.BucketWriter: Renaming hdfs://cmccldULL6400.toshiba.co.jp:8020/user/g0092010/testflume/FlumeData.
1387258572230.tmp to hdfs://<namenode>:8020/user/<my userid>/testflume/
FlumeData.1387258572230

▪ Verify that the log data has stored as events on the HDFS
g0092010@cmc-cldULL6400:~$ hadoop fs -cat
2013-12-17 14:32:19: This is a sample log
@Rapheephan

testflume/*30
file from machine
file from machine
file from machine
file from machine

1.
1.
2.
1.

9

Next steps
▪ Analyse the log data and Visualise it in the (near) real time.

!
!
!
!
!
!
!
!
!

AGENT 1

MapReduce
Hive

AGENT 2

AGENT 4

Mahout

Visualisation
tools

HDFS

Impala

AGENT 3

▪ Improving throughputs of the system.
▪ Analysing and Predicting the future trend.
▪ etc.
@Rapheephan

10

Apache Flume and its use case in Manufacturing

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Apache Flume and its use case in Manufacturing

Similaire à Apache Flume and its use case in Manufacturing (20)

Dernier

Dernier (20)

Apache Flume and its use case in Manufacturing