Supreet Oberoi's presentation on "Large scale log processing with Cascading & Elastic Search". Elasticsearch is becoming a popular platform for log analysis with its ELK stack: Elasticsearch for search, Logstash for centralized logging, and Kibana for visualization. Complemented with Cascading, the application development platform for building Data applications on Apache Hadoop, developers can correlate at scale multiple log and data streams to perform rich and complex log processing before making it available to the ELK stack.
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Elasticsearch + Cascading for Scalable Log Processing
1. DRIVING INNOVATION
THROUGH DATA
LARGE-SCALE LOG PROCESSING WITH CASCADING & LOGSTASH
Elasticsearch Meetup, Oct 30 2014
2. WHAT IS LOG FILE ANALYTICS?
• Making sense of large amounts of [semi|un]structured data
• What type of log file data?
‣ Syslog
‣ Web log files (Apache, Nginix, WebTrends, Omniture)
‣ POS transactions
‣ Advertising impressions (Doubleclick DART, OpenX, Atlas)
‣ Twitter firehose (yes, it’s a log file!)
• Anything with a timestamp and data
2
3. LOGSTASH ARCHITECTURE
3
h"p://www.slashroot.in/logstash1tutorial1linux1central1logging1server7
• Data
collec*on
is
flexible
• Lots
of
input/output
plugins
• Grok
filtering
is
easy
• Kibana
UI
is
a?rac*ve
4. WHAT CAN WE DO WITH CASCADING + LOGSTASH?
• Provide richer log-processing capabilities
• Integrate & correlate with other information
‣ Large list of integration adapters
• Analyze large volumes of log data
• Capture & retain unfiltered log data
• Operationalize your log-processing application
4
5. GET TO KNOW CONCURRENT
5
Leader in Application Infrastructure for Big Data
• Building enterprise software to simplify Big Data application
development and management
Products and Technology
• CASCADING
Open Source - The most widely used application infrastructure for
building Big Data apps with over 175,000 downloads each month
• DRIVEN
Enterprise data application management for Big Data apps
Proven — Simple, Reliable, Robust
• Thousands of enterprises rely on Concurrent to provide their data
application infrastructure.
Founded: 2008
HQ: San Francisco, CA
CEO: Gary Nakamura
CTO, Founder: Chris Wensel
www.concurrentinc.com
6. CASCADING - DE-FACTO STANDARD FOR DATA APPS
Cascading Apps
6
SQL Clojure Ruby
New Fabrics
Tez Storm
Supported Fabrics and Data Stores
Mainframe DB / DW In-Memory Data Stores Hadoop
• Standard for enterprise
data app development
• Your programming
language of choice
• Cascading applications
that run on MapReduce
will also run on Apache
Spark, Storm, and …
7. CASCADING 3.0
7
“Write once and deploy on your fabric of choice.”
• The Innovation — Cascading 3.0 will
allow for data apps to execute on
existing and emerging fabrics
through its new customizable query
planner.
• Cascading 3.0 will support — Local
In-Memory, Apache MapReduce and
soon thereafter (3.1) Apache Tez,
Apache Spark and Apache Storm
Enterprise Data Applications
Local In-Memory MapReduce
Apache
Tez, Storm,
Computation Fabrics
8. … AND INCLUDES RICH SET OF EXTENSIONS
8
http://www.cascading.org/extensions/
9. DEMO: WORD COUNT EXAMPLE WITH CASCADING
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );
9
configuration
integration
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
processing
// specify a regex to split "document" text lines into token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
scheduling
// connect the taps, pipes, etc., into a flow definition
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
// create the Flow
Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work
wcFlow.complete(); // <<-- Runs jobs on Cluster
10. SOME COMMON PATTERNS
• Functions
• Filters
• Joins
‣ Inner / Outer / Mixed
‣ Asymmetrical / Symmetrical
• Merge (Union)
• Grouping
‣ Secondary Sorting
‣ Unique (Distinct)
• Aggregations
‣ Count, Average, etc
10
filter
filter
function
function filter function
data
Pipeline
Split Join
Merge
data
Topology
11. THE STANDARD FOR DATA APPLICATION DEVELOPMENT
11
www.cascading.org
Build data apps
that are
scale-free
Design principals ensure
best practices at any scale
Test-Driven
Development
Efficiently test code and
process local files before
deploying on a cluster
Staffing
Bottleneck
Use existing Java, SQL,
modeling skill sets
Application
Portability
Write once, then run on
different computation
fabrics
Operational
Complexity
Simple - Package up into
one jar and hand to
operations
Systems
Integration
Hadoop never lives alone.
Easily integrate to existing
systems
Proven application development
framework for building data apps
Application platform that addresses:
12. CASCADING
• Java API
• Separates business logic from integration
• Testable at every lifecycle stage
• Works with any JVM language
• Many integration adapters
12
Processing API Integration API
Process Planner
Scheduler API
Scheduler
Apache Hadoop
Cascading
Data Stores
Scripting
Scala, Clojure, JRuby, Jython, Groovy Enterprise Java
13. BUSINESSES DEPEND ON US
• Cascading Java API
• Data normalization and cleansing of search and click-through logs for
use by analytics tools, Hive analysts
• Easy to operationalize heavy lifting of data in one framework
13
16. OPERATIONAL EXCELLENCE
Visibility Through All Stages of App Lifecycle
From Development — Building and Testing
• Design & Development
• Debugging
• Tuning
To Production — Monitoring and Tracking
• Maintain Business SLAs
• Balance & Controls
• Application and Data Quality
• Operational Health
• Real-time Insights
16
18. DEEPER VISUALIZATION INTO YOUR HADOOP CODE
• Easily comprehend, debug, and tune
your data applications
• Get rich insights on your application
performance
• Monitor applications in real-time
• Compare app performance with
historical (previous) iterations
18
Debug and optimize your Hadoop applications more effectively with Driven
19. GET OPERATIONAL INSIGHTS WITH DRIVEN
• Quickly breakdown how often
applications execute based on their tags,
teams, or names
• Immediately identify if any application is
monopolizing cluster resources
• Understand the utilization of your cluster
with a timeline of all applications running
19
Visualize the activity of your applications to help maintain SLAs
20. ORGANIZE YOUR APPLICATIONS WITH GREATER FIDELITY
• Easily keep track of all your
applications by segmenting them with
user-defined tags
• Segment your applications for
trending analysis, cluster analysis,
and developing chargeback models
• Quickly breakdown how often
applications execute based on their
tags, teams, or names
20
Segment your applications for greater insights across all your applications
21. COLLABORATE WITH TEAMS
Utilize teams to collaborate and gain visibility over your set of applications
• Invite others to view and collaborate
on a specific application
• Gain visibility to all the apps and their
owners associated with each team
• Simply manage your teams and the
users assigned to them
21
22. MANAGE PORTFOLIO OF BIG DATA APPLICATIONS
Fast, powerful, rich search capabilities enable you to easily find the exact set of
• Identify problematic apps with their
owners and teams
• Search for groups of applications
segmented by user-defined tags
• Compare specific applications with their
previous iterations to ensure that your
application can meet its SL
22
applications that you’re looking for
23. DRIVEN FOR HIVE: OPERATIONAL VISIBILITY FOR YOUR HIVE APPS
• Understand the anatomy of your Hive app
• Track execution of queries as single business process
• Identify outlier behavior by comparison with historical runs
• Analyze rich operational meta-data
• Correlate Hive app behavior with other events on cluster
23
24. • Logstash provides a flexible and a robust way to collect
log data; Grok lets you parse logs without coding. Kibana
UI is attractive to analyze the information
• Cascading is the de-facto framework for building Big Data
(Hadoop) applications and processing data at scale
• Cascading+Logstash let’s you develop applications to
collect and process large volumes of data
• With Driven, you can put your mission critical log-processing
applications in production and monitor SLAs
TAKE AWAY POINTS
24