Get acquainted with a distributed, reliable tool/service for collecting a large amount of streaming data to centralized storage with their architecture.
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
take care!
2. Flume
Is a distributed, reliable tool/service for collecting large
amount of streaming data to a centralized storage.
Or in simple way it means flume is helpful when we need
to load/collect data continuously in real time and not like
the traditional RDBMS that loads data in batch periodically.
Rupak Roy
3. One of the biggest advantage is when the rate of incoming
data exceeds the rate at which data can be stored to its
destination, then Flume acts as a medium or middle person
between the data source and data storage to provide a steady
flow of data between them.
Example: Log File in general is a file/record that consists of events
of the system operations such as a software creates log file
whenever there is a failure in its operations. On analyzing such
data one can figure out the behavior and locate the failures of
the software.
So whenever we transfer data to HDFS using –put or
–copyFromLocal command, we can only transfer one file at a
time, so to overcome this issue Flume was created to transfer
streaming data without any delay.
Another advantage of flume is the reliability of transferring the
data to HDFS because during a file transfer to HDFS the size of
the file will be zero until it is finished. So if there is any network
issue or power failure in the middle of transferring data, the data
in the HDFS will be lost.
Rupak Roy
4. Apache Flume - Architecture
Source: Source extracts the data from the clients and
transfers it to one or more channels of the Flume.
Source type can be: avro, netcat, seq, exec,
syslogudp, http, twitter etc.
Channel: it acts as a mediator between source and
the sink. It temporarily stores the data from the source
and buffers them until they are consumed by the sinks.
Channel can be of many types like Memory Channel,
File Channel, JDBC channel, custom channel etc.
Sink: consumes the data from the channel and
transfers them to the centralized storages like Hbase
and HDFS. Some of the sink types are logger, avro,
hdfs, irc, file_roll etc.
Rupak Roy
5. Agent: is an independent daemon process. It is a
collection of Source, Channel, Sink that receives
the data from the clients or other Flume Agents
and transfers it to its destination.
A flume agent can have multiple sources, sinks
and channels.
Flume Architecture
Rupak Roy
6. Channel Types:
Memory Channels
These are volatile based memory and restrict the flumes
functions with RAM availability. Whenever there are some
interruptions due to power failures or network issue, any data
that are not transferred will be lost.
However it provides with one universal advantage of volatile
based memory is its SPEED. Memory channel types are faster
than File based channels.
File Channels: are robust channels and uses disk instead of
RAM for any events. It is a bit slower than Memory based
channel but comes with another solid advantage that the
events or the data will not be lost even if there is any
interruptions in the Flume operations due to power failures
or network issues.
Rupak Roy
7. Configuration
First enter the Flume Folder : cd conf
: conf ls
: vi flumepractice.conf then press i (insert mode) then
#Name the components
test.sources = ts1
test.sinks = tk1
test.channels = tc1
#Describe/Configure the source
test.sources.ts1.type = exec
test.sources.ts1.command = tail –F /home/cloudera/hadoop/logs/ ………..
#Describe the sink
test.sink.tk1.type = hdfs
test.sink.tk1.path = hdfs://localhost:9001/flume
#Describe the channel
test.channels.tc1.type = memory
Rupak Roy
8. configuration
#Join/Bind the source and sink to the channel
#Joining the source to the channel
test.sources.ts1.channels= tc1
#Join the sinks to the channel
test.sinks.tk1.channel = tc1
Press ‘esc’ and then type ‘ :wq!‘ To save and exit.
Then run the following commands to start the Flume Job
Flume$ bin/flume–ng agent –n test –f conf/flumepractice.conf;
Where,
flume-ng is the executable file of flume
Agent: specifies a flume agent to be execute.
-n: allows to direct the name of the agent mentioned in the
configuration file.
-f: allows to specify the path of the configuration file.
Rupak Roy
9. Next
Scoop to transfer bulk data to and from
HDFS to Structured Databases
Rupak Roy