Cloudera Morphlines is a new, embeddable, open source Java framework that reduces the time and skills necessary to integrate and build Hadoop applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, analytic online dashboards, or other consumers. If you want to integrate, build, or facilitate streaming or batch transformation pipelines without programming and without MapReduce skills, and get the job done with a minimum amount of fuss and support costs, Morphlines is for you.
In this talk, you'll get an overview of Morphlines internals and explore sample use cases that can be widely applied.
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Large Scale ETL for Hadoop and Cloudera Search using Morphlines
1. 1
Large
Scale
ETL
for
Hadoop
and
Cloudera
Search
using
Morphlines
Wolfgang
Hoschek
(@whoschek)
Silicon
Valley
Java
User
Group
Meetup
Sept
2013
2. Agenda
• Hadoop,
ETL
and
Search
–
seLng
the
stage
• Cloudera
Morphlines
Architecture
• Component
Deep
Dive
• Cloudera
Search
Use
Cases
• What’s
next?
Feel
free
to
ask
quesTons
as
we
go!
3. Example
ETL
Use
Case:
Distributed
Search
on
Hadoop
Flume
Hue
UI
Custom
UI
Custom
App
Solr
Solr
Solr
SolrCloud
query
query
query
Index
(ETL)
Hadoop
Cluster
MR
HDFS
Index
(ETL)
HBase
Index
(ETL)
4. Cloudera
Morphlines
Architecture
Solr
Solr
Solr
SolrCloud
Logs,
tweets,
social
media,
html,
images,
pdf,
text….
Anything
you
want
to
index
Flume,
MR
Indexer,
HBase
indexer,
etc...
Or
your
applicaTon!
Morphline
Library
Morphlines
can
be
embedded
in
any
applicaTon…
Your
App!
5. Cloudera
Morphlines
• Open
Source
framework
for
simple
ETL
• Consume
any
kind
of
data
from
any
kind
of
data
source,
process
and
load
into
any
app
or
storage
system
• Designed
for
Near
Real
Time
apps
&
Batch
apps
• Ships
as
part
Cloudera
Developer
Kit
(CDK)
and
Cloudera
Search
• It’s
a
Java
library
• ASL
licensed
on
github
hbps://github.com/cloudera/cdk
• Similar
to
Unix
pipelines,
but
more
convenient
&
efficient
• ConfiguraTon
over
coding
(reduce
Tme
&
skills)
• Supports
common
file
formats
• Log
Files
&
Text
• Avro,
Sequence
file
• JSON,
HTML
&
XML
• Etc…
(pluggable)
• Extensible
set
of
transformaTon
commands
6. ExtracTon,
TransformaTon
and
Loading
• Chain
of
pipelined
commands
• Simple
and
flexible
data
mapping
&
transformaTon
• Reusable
across
mulTple
index
workloads
• Over
Tme,
extend
and
re-‐
use
across
plaiorm
workloads
syslog
Flume
Agent
Solr
sink
Command:
readLine
Command:
grok
Command:
loadSolr
Solr
Event
Record
Record
Record
Document
Morphline
Library
7. Like
a
Unix
Pipeline
• Like
Unix
pipelines
where
the
data
model
is
generalized
to
work
with
streams
of
generic
records,
including
arbitrary
binary
payloads
• Designed
to
be
embedded
into
Hadoop
components
such
as
Search,
Flume,
MapReduce,
Pig,
Hive,
Sqoop
8. Stdlib
+
plugins
• Framework
ships
with
a
set
of
frequently
used
high
level
transformaTon
and
I/O
commands
that
can
be
combined
in
applicaTon
specific
ways
• The
plugin
system
allows
the
adding
of
new
transformaTons
and
I/O
commands
and
integrates
exisTng
funcTonality
and
third
party
systems
in
a
straighiorward
manne
9. Flexible
Data
Model
• A
record
is
a
set
of
named
fields
where
each
field
has
an
ordered
list
of
one
or
more
Java
Objects
(i.e.
Guava’s
ArrayListMulTmap)
• Field
can
have
mulTple
values
and
any
two
records
need
not
use
common
field
names
• Corresponds
exactly
to
Solr/Lucene
data
model
• Pass
not
only
structured
data,
but
also
arbitrary
binary
data
10. Passing
Binary
Data
• _abachment_body
field
(opTonal)
• java.io.InputStream
or
Java
byte[]
• opTonal
fields
assist
w/
detecTng
&
parsing
data
type
• _abachment_mimetype
field
• e.g.
"applicaTon/pdf"
• _abachment_charset
field
• e.g.
"UTF-‐8"
• _abachment_name
field
• e.g.
"cars.pdf”
• Conceptually
similar
to
email
and
HTTP
headers/body
11. Processing
Model
• Morphline
commands
manipulate
conTnuous
or
arbitrarily
large
streams
of
records
• A
command
transforms
a
record
into
zero
or
more
records
• The
output
records
of
a
command
are
passed
to
the
next
command
in
the
chain
• A
command
can
contain
nested
commands
• A
morphline
is
a
tree
of
commands,
essenTally
a
push-‐based
data
flow
engine
12. Processing
Model
Non-‐Goals
• Designed
to
embedded
into
mulTple
host
systems,
thus…
• No
noTon
of
persistence
or
durability
or
distributed
compuTng
or
node
failover
• Basically
just
a
chain
of
in-‐memory
transformaTons
in
the
current
thread
• No
need
to
manage
mulTple
nodes
or
threads
-‐
already
covered
by
host
systems
such
as
MapReduce,
Flume,
Storm,
Samza,
etc.
• However,
a
morphline
does
support
passing
noTficaTons
• E.g.
BEGIN_TRANSACTION,
COMMIT_TRANSACTION,
ROLLBACK_TRANSACTION,
SHUTDOWN
13. Performance
and
Scaling
• The
runTme
compiles
morphline
on
the
fly
• The
runTme
processes
all
commands
of
a
given
morphline
in
the
same
thread
• Piping
a
record
from
one
command
to
another
is
fast
• just
a
cheap
Java
method
call
• no
queues,
no
handoffs
among
threads,
no
context
switches,
and
no
serializaTon
between
commands
• For
scalability,
deploy
many
morphline
instances
on
a
cluster
in
many
Flume
agents
and
MapReduce
tasks
14. Syntax
• HOCON
format
(Human-‐OpTmized
Config
Object
NotaTon)
• Basically
JSON
slightly
adjusted
for
the
configuraTon
file
use
case
• Came
out
of
typesafe.com
• Also
used
by
Akka
and
Play
frameworks
15. Example:
Indexing
log4j
w/
stacktraces
juil. 25, 2012 10:49:40 AM hudson.triggers.SafeTimerTask run ok
juil. 25, 2012 10:49:46 AM hudson.triggers.SafeTimerTask run failed
com.amazonaws.AmazonClientException: Unable to calculate a request signature
at com.amazonaws.auth.AbstractAWSSigner.signAndBase64Encode(AbstractAWSSigner.java:71)
at java.util.TimerThread.run(Timer.java:505)
Caused by: com.amazonaws.AmazonClientException: Unable to calculate a request signature
at com.amazonaws.auth.AbstractAWSSigner.sign(AbstractAWSSigner.java:90)
at com.amazonaws.auth.AbstractAWSSigner.signAndBase64Encode(AbstractAWSSigner.java:68)
... 14 more
Caused by: java.lang.IllegalArgumentException: Empty key
at javax.crypto.spec.SecretKeySpec.<init>(SecretKeySpec.java:96)
at com.amazonaws.auth.AbstractAWSSigner.sign(AbstractAWSSigner.java:87)
... 15 more
juil. 25, 2012 10:49:54 AM hudson.slaves.SlaveComputer tryReconnect
Record
1
Record
2
Record
3
18. Current
Command
Library
• Integrate
with
and
load
into
Apache
Solr
• Flexible
log
file
analysis
• Single-‐line
record,
mulT-‐line
records,
CSV
files
• Regex
based
pabern
matching
and
extracTon
• IntegraTon
with
Avro,
JSON,
XML,
HTML
• IntegraTon
with
Apache
Hadoop
Sequence
Files
• IntegraTon
with
SolrCell
and
all
Apache
Tika
parsers
• Auto-‐detecTon
of
MIME
types
from
binary
data
using
Apache
Tika
19. Current
Command
Library
(cont’d)
• ScripTng
support
for
dynamic
java
code
• OperaTons
on
fields
for
assignment
and
comparison
• OperaTons
on
fields
with
list
and
set
semanTcs
• if-‐then-‐else
condiTonals
• A
small
rules
engine
(tryRules)
• String
and
Tmestamp
conversions
• slf4j
logging
• Yammer
metrics
and
counters
• Decompression
and
unpacking
of
arbitrarily
nested
container
file
formats
• etc
20. Plugin
Commands
• Easy
to
add
new
I/O
&
transformaTon
cmds
• Integrate
exisTng
funcTonality
and
third
party
systems
• Implement
Java
interface
Command
or
subclass
AbstractCommand
• Add
it
to
Java
classpath
• No
registraTon
or
other
administraTve
acTon
required
21. Morphline
Example
–
syslog
with
grok
morphlines
:
[
{
id
:
morphline1
importCommands
:
["com.cloudera.**",
"org.apache.solr.**"]
commands
:
[
{
readLine
{}
}
{
grok
{
dicTonaryFiles
:
[/tmp/grok-‐dicTonaries]
expressions
:
{
message
:
"""<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_Tmestamp}
%
{SYSLOGHOST:syslog_hostname}
%{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:
%
{GREEDYDATA:syslog_message}"""
}
}
}
{
loadSolr
{}
}
]
}
]
Example
Input
<164>Feb
4
10:46:14
syslog
sshd[607]:
listening
on
0.0.0.0
port
22
Output
Record
syslog_pri:164
syslog_Tmestamp:Feb
4
10:46:14
syslog_hostname:syslog
syslog_program:sshd
syslog_pid:607
syslog_message:listening
on
0.0.0.0
port
22.
22. Example
Java
Driver
Program
/** Usage: java ... <morphline.conf> <dataFile1> ... <dataFileN> */
public static void main(String[] args) {
// compile morphline.conf file on the fly
File conf= new File(args[0]);
MorphlineContext ctx= new MorphlineContext.Builder().build();
Command morphline = new Compiler().compile(conf, null, ctx, null);
// process each input data file
Notifications.notifyBeginTransaction(morphline);
for (int i = 1; i < args.length; i++) {
InputStream in = new FileInputStream(new File(args[i]));
Record record = new Record();
record.put(Fields.ATTACHMENT_BODY, in);
morphline.process(record);
in.close();
}
Notifications.notifyCommitTransaction(morphline);
}
23. PotenTal
New
Plugin
Commands
• Extract,
clean,
transform,
join,
integrate,
enrich
and
decorate
records
• Examples
• join
records
with
external
data
sources
such
as
relaTonal
databases,
key-‐value
stores,
local
files
or
IP
Geo
lookup
tables.
• Perform
DNS
resoluTon,
expand
shortened
URLs
• fetch
linked
metadata
from
social
networks
• do
senTment
analysis
&
annotate
record
accordingly
• conTnuously
maintain
stats
over
sliding
windows
• compute
exact
or
approx.
disTnct
values
&
quanTles
24. Example
Command
ImplementaTon
(1/2)
public final class ToStringBuilder implements CommandBuilder {
@Override
public Collection<String> getNames() {
return Collections.singletonList("toString");
}
@Override
public Command build(Config config, Command parent, Command child,
MorphlineContext context) {
return new ToString(config, parent, child, context);
}
private static final class ToString extends AbstractCommand {
@Override
protected boolean doProcess(Record record) {
// some custom processing goes here
return super.doProcess(record); // pass to next command in chain
}
}
}
25. Example
Command
ImplementaTon
(2/2)
private static final class ToString extends AbstractCommand {
private final String fieldName;
private final boolean trim;
public ToString(Config config, Command parent, Command child,
MorphlineContext context) {
super(config, parent, child, context);
this.fieldName = getConfigs().getString(config, "field");
this.trim = getConfigs().getBoolean(config, "trim", false);
validateArguments();
}
@Override
protected boolean doProcess(Record record) {
ListIterator iter = record.get(fieldName).listIterator();
while (iter.hasNext()) {
String str = iter.next().toString();
iter.set(trim ? str.trim() : str);
}
return super.doProcess(record); // pass to next command in chain
}
}
}
26. Use
Case:
Cloudera
Search
An
Integrated
Part
of
the
Hadoop
System
One
pool
of
data
One
security
framework
One
set
of
system
resources
One
management
interface
27. What
is
Cloudera
Search?
• Full-‐text,
interacTve
search
and
faceted
navigaTon
• Batch,
near
real-‐Tme,
and
on-‐demand
indexing
• Apache
Solr
integrated
with
CDH
• Established,
mature
search
with
vibrant
community
• Separate
runTme
like
MapReduce,
Impala
• Incorporated
as
part
of
the
Hadoop
ecosystem
• Open
Source
• 100%
Apache,
100%
Solr
• Standard
Solr
APIs
28. ETL
for
Distributed
Search
on
Apache
Hadoop
Flume
Hue
UI
Custom
UI
Custom
App
Solr
Solr
Solr
SolrCloud
query
query
query
Index
(ETL)
Hadoop
Cluster
MR
HDFS
Index
(ETL)
HBase
Index
(ETL)
29. Near
Real
Time
ETL
&
Indexing
with
Flume
Log
File
Apache
Solr
and
Apache
Flume
• Data
ingest
at
scale
• Flexible
extracTon
and
mapping
• Indexing
at
data
ingest
• Packaged
as
Flume
Morphline
Solr
Sink
HDFS
Flume
Agent
Indexer
w/
Morphline
Other
Log
File
Flume
Agent
Indexer
w/
Morphline
29
agent.sinks.solrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink
agent.sinks.solrSink.morphlineFile = /etc/flume-ng/conf/morphline.conf
Flume.conf
33. Near
Real
Time
indexing
of
Apache
HBase
HDFS
HBase
interacTve
load
Lily
HBase
Indexer(s)
with
Morphline
Triggers
on
updates
Solr
server
Solr
server
Solr
server
Solr
server
Solr
server
Search
+
=
Large
scale
tabular
data
immediate
access
&
updates
fast
&
flexible
informaDon
discovery
BIG
DATA
DATAMANAGEMENT
34. Batch
&
Near
Real
Time
ETL
Tweets
Flume Solr
Hue UI
HDFS
MapReduceIndexerTool, Impala, HBase, Mahout, EDW, MR, etc
Lily HBase Indexer
HdfsSink
Query
MapReduce
IndexerTool
Log Formats
Social Media
HTML
Images
PDF
Custom UI
Query
Custom App
...
Morphline
Morphline
MorphlineSink
Morphline
HBase
OLTP
35. What’s
next
• More
work
on
Apache
Hbase
IntegraTon
• IntegraTon
into
Apache
Hive
&
Sqoop
• Stream
AnalyTcs
36. Conclusion
• Cloudera
Development
Kit
w/
Morphlines
• Open
Source
-‐
ASL
License
• Version
0.7.0
shipping
• Extensive
documentaTon
• Send
your
quesTons
and
feedback
to
cdk-‐dev
mailing
list
• Also
ships
integrated
with
Cloudera
Search
• Free
QuickStart
VM
also
available!