Ingest and Indexing in CDH4 Hadoop Environment

Ingest and Indexing of
RSS News Feeds in the
Hadoop Environment
Stephanie F. Guadagno
January 2014

SFG- 1/9/2014

1

Introduction
• Work is being done on a Virtual Machine, loaded
with Cloudera’s CDH 4.3.
• Used Flume 1.3, Cloudera’s Morphlines, Cloudera
Search with Solr 4.3, Hadoop 2.0.

• Used Flume to pull over RSS News Feeds from
BBC World News into HDFS.
• The news data, in HDFS, was indexed and loaded
into Solr using the MapReduceIndexerTool and
the Cloudera’s Morphlines framework.
SFG- 1/9/2014

2

Overview of Components Used
• Flume is used to reliably ingest large amounts of data from various
sources (e.g. log files, Web Sites, Social Media Sites) into a
centralized or distributed data store, such HBase or HDFS.
• MapReduceIndexerTool is a MapReduce batch job driver used with
Cloudera Search. The tool is used to index a set of input files and
then write the indexes into HDFS. The GoLive feature will merge
the output shards into a set of live Solr servers (e.g. a SolrCloud).
• Cloudera Morphlines is a new open source framework that
facilitates simple ETL of ingested data into Apache Solr. The
framework consists of the new Morphlines library and
specifications for creating a “morphline”, which encompasses a
chain of transformation commands.
• Cloudera Search facilitates Big Data search by bringing search and
scalable indexing from Solr 4.X into the Hadoop ecosystem.

SFG- 1/9/2014

3

Flume’s Data Flow
•

•
•
•
•

A Flume Agent is a Java process
that hosts the Flume Source,
Channel, and Sink components
through which events flow from an
external source to the next
destination.
An event is a unit of data that flows
through the components.
A Flume Source listens for events
and writes the event to the
Channel.
The Channel queues the events as
transactions.
The Flume Sink writes the event to
the external source (e.g. HDFS,
HBase, Solr, or a file) and removes
the event from the queue.

SFG- 1/9/2014

External Source
(e.g. Social Media, Log files,
Web Pages, RSS News Feed)
in a format recognized by
the Flume Source

Channel

Source

(e.g. Memory,
File, JDBC)

(e.g. Avro, Exec,
HTTP, JMS,
Syslog, etc.)

Agent

Sink
(e.g. File, HDFS,
HBase,
Morphline Solr
Sink)

in a format specified by
the Flume Sink

File

HBase,
HDFS, Solr

4

Morphline Data Flow
•
•

•
•

•

Cloudera’s Morphlines is a Java
library that was developed as part of
Cloudera Search.
The library contains a suite of
frequently used transformation and
I/O “command” classes for use in
simple ETL on data flows into Solr.
The library can be integrated into
Flume for near-real-time ETL or into
MapReduce for batch ETL.
For batch ETL, Cloudera provides the
MapReduceIndexerTool for data in
HDFS. For data in HBase, the tool is
the HBaseMapReduceIndexerTool.
A morphline will consume input
records, which are then turned into a
stream of records. The stream of
records are piped through a chain of
transformation commands.

SFG- 1/9/2014

Source
B
a
t
c
h

HDFS,
HBase

cmd

N
R
T

…

cmd
record

Flume
Source

cmd

record

Morphline

Solr

5

News Feed ETL Data Flow
1) Ingest using Flume

2) Index using MapReduce and Morphline

External Source

Custom
Source

Morphline

(BBC RSS News Feeds – us, uk, asia, etc.)

Configuration File

Memory
Channel
Avro JSON
record(s)

HDFS
Sink

MapReduceIndexerTool

MapReduce

Agent

(org.apache.solr.hadoop.MapReduceIndexerTool)

(“agent”)

Avro JSON
record(s)

HDFS
("newsfeeds/”)

•

•
•
•

Implemented a Custom Flume Event Driven Source to pull RSS
News Feeds from BBC World News. Details:
– Must implement Flume’s EventDrivenSource interface
– Parsed the News Feeds items
– Wrote each item to the Channel in Avro JSON format
Ensured the Agent was defined. CDH4 came with an agent
called “tier1”. I created an agent called “agent”.
Configured the Data Flow in a Flume agent configuration file.
Wrote a script that runs the flume agent with the agent
configuration file.

SFG- 1/9/2014

Solr Cloud
(“news_feeds”)

•
•
•
•

Created Solr Instance for the “news_feeds” collection
with modified Schema for fields in news feed data.
Created the “news_feeds” collection with 1 shard.
Wrote Morphine File
Wrote a script that runs the MapReduceIndexerTool with
the Morphline specification file.

6

News Feed Ingest Details-1 of 2
(Configuration)
# Flume Data Flow Configuration
# ----------------------------------------------# Definitions
agent.sources=news-source
agent.channels=memory-ch
agent.sinks=hdfs-sink

External Source

Custom
Source

Memory
Channel
Avro JSON
record(s)

HDFS
Sink

Agent

# Channel (memory channel with queue capacity of 5000)
agent.channels.memory-ch.type=memory
agent.channels.memory-ch.capacity=5000

(“agent”)

HDFS
("newsfeeds/”)

Chose:
1. Custom Flume Event Driven Source
2. Memory Channel
3. HDFS Sink
SFG- 1/9/2014

# Sources (ingest using RSSFlumeSourceReader class)
agent.sources.news-source.type=dataingest.rssfeeds.RSSFlumeSourceReader
agent.sources.news-source.channels=memory-ch

# Sink (output to HDFS in Text format)
agent.sinks.hdfs-sink.type=hdfs
agent.sinks.hdfs-sink.channel=memory-ch
agent.sinks.hdfs-sink.hdfs.path=
hdfs://localhost:8020/user/cloudera/flume/newsfeeds
agent.sinks.hdfs-sink.hdfs.filePrefix=input
agent.sinks.hdfs-sink.hdfs.fileType = DataStream
agent.sinks.hdfs-sink.hdfs.writeFormat = Text

7

News Feed Ingest Details-2 of 2
(Custom Source – RSSFlumeSourceReader)
public class RSSFlumeSourceReader extends AbstractSource
implements EventDrivenSource, Configurable
{
ChannelProcessor cp = getChannelProcessor();

External Source

Custom
Source

Memory
Channel
Avro JSON
record(s)

HDFS
Sink

Agent
(“agent”)

HDFS
("newsfeeds/”)

SFG- 1/9/2014

@Override
public synchronized void start()
{
super.start();
// for each URL
{
// read RSS News Feeds; using java.net.URL
// obtain Document by parsing news using DocumentBuilder from
//
javax.xml.parsers
// get NodeList object for “item” tag contain in the Document object
// for each node in the NodeList object
{
// write data in Avro JSON format using Apache Avro library
}
// create Flume Event and Send Event to Channel
Event event = EventBuilder.withBody(out.toString(), Charsets.UTF_8);
cp.processEvent(event);
}
}
@Override
public synchronized void stop()
{
super.stop();
}
}

8

News Feed Data in HDFS

SFG- 1/9/2014

9

News Feed Data Indexing Details-1 of 3
(MapReduce)
Morphline
Configuration File

MapReduceIndexerTool
Avro JSON
record(s)

MapReduce
(org.apache.solr.hadoop.MapReduceIndexerTool)

HDFS

Solr Cloud

("newsfeeds/”)

(“news_feeds”)

Two tools being used:
1. HdfsFindTool : used to get the most recent files changed
within the past day.
2. MapReducerIndexerTool: will run MapReduce job to index
the HDFS input files and push the index to Solr.

# Go-live merges the output shards of the previous phase into a
# set of on-line Solr servers.
#
echo “Running go-live mode“
hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar
org.apache.solr.hadoop.HdfsFindTool
-find hdfs:///${HDFS_INDIR}
-type f
-name 'in*'
-mtime -1 |
sudo -u hdfs hadoop --config /etc/hadoop/conf.cloudera.mapreduce1
jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar
org.apache.solr.hadoop.MapReduceIndexerTool
--libjars /usr/lib/solr/contrib/mr/search-mr-0.9.1-cdh4.3.0-SNAPSHOT.jar
-D 'mapred.child.java.opts=-Xmx500m'
--log4j ${LOG_FILE} ${DRYRUN}
--morphline-file ${MORPHLINE_FILE}
--update-conflict-resolver
org.apache.solr.hadoop.dedup.RetainMostRecentUpdateConflictResolver
${REDUCERS_ARG}
--verbose
--output-dir hdfs://localhost:8020/${HDFS_SOLR_IDXDIR}
--go-live
--zk-host localhost:2181/solr
--collection ${COLLECTION}
--input-list echo “Clean-up tmp directory"
sudo -u hdfs rm /tmp/solr*.zip
echo "Done."

SFG- 1/9/2014

10

(Morphline)
SOLR_LOCATOR : {
# specifiy collection and zkHost
}
morphlines : [
{
id : morphlineNewsFeed
importCommands : ["com.cloudera.**"]
commands : [

“readAvro”

Record

“extractAvroPaths”
Record

“convertTimestamp”
Record

“sanitizeUnknown
SolrFields”

Tid-bits
{ readAvro {
isJson : true
writerSchemaFile: /home/dataingest/schema/NewsRecord.avsc
}}

{ extractAvroPaths {
flatten : false
paths : {
id: /id
title: /Title
url: /Link
published_date: /Publish_Date
author: /Author
comments: /Comments
description: /Description }
}}

Record

“loadSolr”
Document
]}]

Solr Cloud
(“news_feeds”)

SFG- 1/9/2014

{ loadSolr {
solrLocator : ${SOLR_LOCATOR}
} }



HOCON format: Human-Optimized
Configuration format. JSON-like format



Morphline is defined with a tree of
commands.



The output of one command is sent to
the next command.



The morphline is compiled at run-time.

# Convert last_modified to native Solr timestamp format
{ convertTimestamp {
field : published_date
inputFormats : ["EEE, d MMM yyyy HH:mm:ss z",
"EEE, dd MMM yyyy HH:mm:ss z"]
inputTimezone : GMT
outputTimezone: US/Eastern
} }
# Solr will throw an exception on any attempt to load
# a document containing a field not specified in schema.xml.
{ sanitizeUnknownSolrFields {
# Location from which to fetch Solr schema
solrLocator : ${SOLR_LOCATOR}
} }

11

(Solr Collection)
• The “news_feeds” Solr collection presently contains 3800 documents
in the index.

SFG- 1/9/2014

12

News Feed Document in Solr

SFG- 1/9/2014

13

Summary
• Presented ingest of RSS News Feeds using
– Flume with a Custom Source, Memory Channel,
and HDFS Sink

• Presented indexing of News Feed data into
HDFS using
– Cloudera’s Morphlines library and “morphline”
configuration
– Cloudera’s MapReduceIndexerTool
– Cloudera Search with Solr 4.X
SFG- 1/9/2014

14

Thank You

Stephanie F. Guadagno
January 2014

SFG- 1/9/2014

15

References
•

•
•

•

•
•

Flume Developer’s Guide; http://flume.apache.org/FlumeDeveloperGuide.htmlThe Apache
Software Foundation; 2009-2012
Flume User Guide; http://flume.apache.org/FlumeUserGuide.html; The Apache Software
Foundation; 2009-2012
GoLive; http://www.cloudera.com/content/cloudera-content/clouderadocs/Search/latest/Cloudera-Search-UserGuide/csug_batch_index_to_solr_servers_using_golive.html; Cloudera, Inc.; 2014
MapReduceIndexerTool; http://www.cloudera.com/content/cloudera-content/clouderadocs/Search/latest/Cloudera-Search-User-Guide/csug_mapreduceindexertool.html;
Cloudera, Inc.; 2014
Morphlines; http://blog.cloudera.com/blog/2013/07/morphlines-the-easy-way-to-build-andintegrate-etl-apps-for-apache-hadoop/; Wolfgang Hoschek; Cloudera, Inc.; July 11, 2013
Morphlines ETL; http://www.cloudera.com/content/cloudera-content/clouderadocs/Search/latest/Cloudera-Search-User-Guide/csug_etl_morphlines.html; Cloudera, Inc.;
2014

SFG- 1/9/2014

16

Ingest and Indexing in CDH4 Hadoop Environment

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Ingest and Indexing in CDH4 Hadoop Environment