More than Just Lines on a Map: Best Practices for U.S Bike Routes
Ingest and Indexing in CDH4 Hadoop Environment
1. Ingest and Indexing of
RSS News Feeds in the
Hadoop Environment
Stephanie F. Guadagno
January 2014
SFG- 1/9/2014
1
2. Introduction
• Work is being done on a Virtual Machine, loaded
with Cloudera’s CDH 4.3.
• Used Flume 1.3, Cloudera’s Morphlines, Cloudera
Search with Solr 4.3, Hadoop 2.0.
• Used Flume to pull over RSS News Feeds from
BBC World News into HDFS.
• The news data, in HDFS, was indexed and loaded
into Solr using the MapReduceIndexerTool and
the Cloudera’s Morphlines framework.
SFG- 1/9/2014
2
3. Overview of Components Used
• Flume is used to reliably ingest large amounts of data from various
sources (e.g. log files, Web Sites, Social Media Sites) into a
centralized or distributed data store, such HBase or HDFS.
• MapReduceIndexerTool is a MapReduce batch job driver used with
Cloudera Search. The tool is used to index a set of input files and
then write the indexes into HDFS. The GoLive feature will merge
the output shards into a set of live Solr servers (e.g. a SolrCloud).
• Cloudera Morphlines is a new open source framework that
facilitates simple ETL of ingested data into Apache Solr. The
framework consists of the new Morphlines library and
specifications for creating a “morphline”, which encompasses a
chain of transformation commands.
• Cloudera Search facilitates Big Data search by bringing search and
scalable indexing from Solr 4.X into the Hadoop ecosystem.
SFG- 1/9/2014
3
4. Flume’s Data Flow
•
•
•
•
•
A Flume Agent is a Java process
that hosts the Flume Source,
Channel, and Sink components
through which events flow from an
external source to the next
destination.
An event is a unit of data that flows
through the components.
A Flume Source listens for events
and writes the event to the
Channel.
The Channel queues the events as
transactions.
The Flume Sink writes the event to
the external source (e.g. HDFS,
HBase, Solr, or a file) and removes
the event from the queue.
SFG- 1/9/2014
External Source
(e.g. Social Media, Log files,
Web Pages, RSS News Feed)
in a format recognized by
the Flume Source
Channel
Source
(e.g. Memory,
File, JDBC)
(e.g. Avro, Exec,
HTTP, JMS,
Syslog, etc.)
Agent
Sink
(e.g. File, HDFS,
HBase,
Morphline Solr
Sink)
in a format specified by
the Flume Sink
File
HBase,
HDFS, Solr
4
5. Morphline Data Flow
•
•
•
•
•
Cloudera’s Morphlines is a Java
library that was developed as part of
Cloudera Search.
The library contains a suite of
frequently used transformation and
I/O “command” classes for use in
simple ETL on data flows into Solr.
The library can be integrated into
Flume for near-real-time ETL or into
MapReduce for batch ETL.
For batch ETL, Cloudera provides the
MapReduceIndexerTool for data in
HDFS. For data in HBase, the tool is
the HBaseMapReduceIndexerTool.
A morphline will consume input
records, which are then turned into a
stream of records. The stream of
records are piped through a chain of
transformation commands.
SFG- 1/9/2014
Source
B
a
t
c
h
HDFS,
HBase
cmd
N
R
T
…
cmd
record
Flume
Source
cmd
record
Morphline
Solr
5
6. News Feed ETL Data Flow
1) Ingest using Flume
2) Index using MapReduce and Morphline
External Source
Custom
Source
Morphline
(BBC RSS News Feeds – us, uk, asia, etc.)
Configuration File
Memory
Channel
Avro JSON
record(s)
HDFS
Sink
MapReduceIndexerTool
MapReduce
Agent
(org.apache.solr.hadoop.MapReduceIndexerTool)
(“agent”)
Avro JSON
record(s)
HDFS
("newsfeeds/”)
•
•
•
•
Implemented a Custom Flume Event Driven Source to pull RSS
News Feeds from BBC World News. Details:
– Must implement Flume’s EventDrivenSource interface
– Parsed the News Feeds items
– Wrote each item to the Channel in Avro JSON format
Ensured the Agent was defined. CDH4 came with an agent
called “tier1”. I created an agent called “agent”.
Configured the Data Flow in a Flume agent configuration file.
Wrote a script that runs the flume agent with the agent
configuration file.
SFG- 1/9/2014
Solr Cloud
(“news_feeds”)
•
•
•
•
Created Solr Instance for the “news_feeds” collection
with modified Schema for fields in news feed data.
Created the “news_feeds” collection with 1 shard.
Wrote Morphine File
Wrote a script that runs the MapReduceIndexerTool with
the Morphline specification file.
6
10. News Feed Data Indexing Details-1 of 3
(MapReduce)
Morphline
Configuration File
MapReduceIndexerTool
Avro JSON
record(s)
MapReduce
(org.apache.solr.hadoop.MapReduceIndexerTool)
HDFS
Solr Cloud
("newsfeeds/”)
(“news_feeds”)
Two tools being used:
1. HdfsFindTool : used to get the most recent files changed
within the past day.
2. MapReducerIndexerTool: will run MapReduce job to index
the HDFS input files and push the index to Solr.
# Go-live merges the output shards of the previous phase into a
# set of on-line Solr servers.
#
echo “Running go-live mode“
hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar
org.apache.solr.hadoop.HdfsFindTool
-find hdfs:///${HDFS_INDIR}
-type f
-name 'in*'
-mtime -1 |
sudo -u hdfs hadoop --config /etc/hadoop/conf.cloudera.mapreduce1
jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar
org.apache.solr.hadoop.MapReduceIndexerTool
--libjars /usr/lib/solr/contrib/mr/search-mr-0.9.1-cdh4.3.0-SNAPSHOT.jar
-D 'mapred.child.java.opts=-Xmx500m'
--log4j ${LOG_FILE} ${DRYRUN}
--morphline-file ${MORPHLINE_FILE}
--update-conflict-resolver
org.apache.solr.hadoop.dedup.RetainMostRecentUpdateConflictResolver
${REDUCERS_ARG}
--verbose
--output-dir hdfs://localhost:8020/${HDFS_SOLR_IDXDIR}
--go-live
--zk-host localhost:2181/solr
--collection ${COLLECTION}
--input-list echo “Clean-up tmp directory"
sudo -u hdfs rm /tmp/solr*.zip
echo "Done."
SFG- 1/9/2014
10
11. News Feed Data Indexing Details-2 of 3
(Morphline)
SOLR_LOCATOR : {
# specifiy collection and zkHost
}
morphlines : [
{
id : morphlineNewsFeed
importCommands : ["com.cloudera.**"]
commands : [
“readAvro”
Record
“extractAvroPaths”
Record
“convertTimestamp”
Record
“sanitizeUnknown
SolrFields”
Tid-bits
{ readAvro {
isJson : true
writerSchemaFile: /home/dataingest/schema/NewsRecord.avsc
}}
{ extractAvroPaths {
flatten : false
paths : {
id: /id
title: /Title
url: /Link
published_date: /Publish_Date
author: /Author
comments: /Comments
description: /Description }
}}
Record
“loadSolr”
Document
]}]
Solr Cloud
(“news_feeds”)
SFG- 1/9/2014
{ loadSolr {
solrLocator : ${SOLR_LOCATOR}
} }
HOCON format: Human-Optimized
Configuration format. JSON-like format
Morphline is defined with a tree of
commands.
The output of one command is sent to
the next command.
The morphline is compiled at run-time.
# Convert last_modified to native Solr timestamp format
{ convertTimestamp {
field : published_date
inputFormats : ["EEE, d MMM yyyy HH:mm:ss z",
"EEE, dd MMM yyyy HH:mm:ss z"]
inputTimezone : GMT
outputTimezone: US/Eastern
} }
# Solr will throw an exception on any attempt to load
# a document containing a field not specified in schema.xml.
{ sanitizeUnknownSolrFields {
# Location from which to fetch Solr schema
solrLocator : ${SOLR_LOCATOR}
} }
11
12. News Feed Data Indexing Details-3 of 3
(Solr Collection)
• The “news_feeds” Solr collection presently contains 3800 documents
in the index.
SFG- 1/9/2014
12
14. Summary
• Presented ingest of RSS News Feeds using
– Flume with a Custom Source, Memory Channel,
and HDFS Sink
• Presented indexing of News Feed data into
HDFS using
– Cloudera’s Morphlines library and “morphline”
configuration
– Cloudera’s MapReduceIndexerTool
– Cloudera Search with Solr 4.X
SFG- 1/9/2014
14