SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
DRIVING INNOVATION 
THROUGH DATA 
LARGE-SCALE LOG PROCESSING WITH CASCADING & LOGSTASH 
Elasticsearch Meetup, Oct 30 2014
WHAT IS LOG FILE ANALYTICS? 
• Making sense of large amounts of [semi|un]structured data 
• What type of log file data? 
‣ Syslog 
‣ Web log files (Apache, Nginix, WebTrends, Omniture) 
‣ POS transactions 
‣ Advertising impressions (Doubleclick DART, OpenX, Atlas) 
‣ Twitter firehose (yes, it’s a log file!) 
• Anything with a timestamp and data 
2
LOGSTASH ARCHITECTURE 
3 
h"p://www.slashroot.in/logstash1tutorial1linux1central1logging1server7 
• Data 
collec*on 
is 
flexible 
• Lots 
of 
input/output 
plugins 
• Grok 
filtering 
is 
easy 
• Kibana 
UI 
is 
a?rac*ve
WHAT CAN WE DO WITH CASCADING + LOGSTASH? 
• Provide richer log-processing capabilities 
• Integrate & correlate with other information 
‣ Large list of integration adapters 
• Analyze large volumes of log data 
• Capture & retain unfiltered log data 
• Operationalize your log-processing application 
4
GET TO KNOW CONCURRENT 
5 
Leader in Application Infrastructure for Big Data 
• Building enterprise software to simplify Big Data application 
development and management 
Products and Technology 
• CASCADING 
Open Source - The most widely used application infrastructure for 
building Big Data apps with over 175,000 downloads each month 
• DRIVEN 
Enterprise data application management for Big Data apps 
Proven — Simple, Reliable, Robust 
• Thousands of enterprises rely on Concurrent to provide their data 
application infrastructure. 
Founded: 2008 
HQ: San Francisco, CA 
CEO: Gary Nakamura 
CTO, Founder: Chris Wensel 
www.concurrentinc.com
CASCADING - DE-FACTO STANDARD FOR DATA APPS 
Cascading Apps 
6 
SQL Clojure Ruby 
New Fabrics 
Tez Storm 
Supported Fabrics and Data Stores 
Mainframe DB / DW In-Memory Data Stores Hadoop 
• Standard for enterprise 
data app development 
• Your programming 
language of choice 
• Cascading applications 
that run on MapReduce 
will also run on Apache 
Spark, Storm, and …
CASCADING 3.0 
7 
“Write once and deploy on your fabric of choice.” 
• The Innovation — Cascading 3.0 will 
allow for data apps to execute on 
existing and emerging fabrics 
through its new customizable query 
planner. 
• Cascading 3.0 will support — Local 
In-Memory, Apache MapReduce and 
soon thereafter (3.1) Apache Tez, 
Apache Spark and Apache Storm 
Enterprise Data Applications 
Local In-Memory MapReduce 
Apache 
Tez, Storm, 
Computation Fabrics
… AND INCLUDES RICH SET OF EXTENSIONS 
8 
http://www.cascading.org/extensions/
DEMO: WORD COUNT EXAMPLE WITH CASCADING 
String docPath = args[ 0 ]; 
String wcPath = args[ 1 ]; 
Properties properties = new Properties(); 
AppProps.setApplicationJarClass( properties, Main.class ); 
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); 
9 
configuration 
integration 
// create source and sink taps 
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); 
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); 
processing 
// specify a regex to split "document" text lines into token stream 
Fields token = new Fields( "token" ); 
Fields text = new Fields( "text" ); 
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); 
// only returns "token" 
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); 
// determine the word counts 
Pipe wcPipe = new Pipe( "wc", docPipe ); 
wcPipe = new GroupBy( wcPipe, token ); 
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); 
scheduling 
// connect the taps, pipes, etc., into a flow definition 
FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) 
.addSource( docPipe, docTap ) 
.addTailSink( wcPipe, wcTap ); 
// create the Flow 
Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work 
wcFlow.complete(); // <<-- Runs jobs on Cluster
SOME COMMON PATTERNS 
• Functions 
• Filters 
• Joins 
‣ Inner / Outer / Mixed 
‣ Asymmetrical / Symmetrical 
• Merge (Union) 
• Grouping 
‣ Secondary Sorting 
‣ Unique (Distinct) 
• Aggregations 
‣ Count, Average, etc 
10 
filter 
filter 
function 
function filter function 
data 
Pipeline 
Split Join 
Merge 
data 
Topology
THE STANDARD FOR DATA APPLICATION DEVELOPMENT 
11 
www.cascading.org 
Build data apps 
that are 
scale-free 
Design principals ensure 
best practices at any scale 
Test-Driven 
Development 
Efficiently test code and 
process local files before 
deploying on a cluster 
Staffing 
Bottleneck 
Use existing Java, SQL, 
modeling skill sets 
Application 
Portability 
Write once, then run on 
different computation 
fabrics 
Operational 
Complexity 
Simple - Package up into 
one jar and hand to 
operations 
Systems 
Integration 
Hadoop never lives alone. 
Easily integrate to existing 
systems 
Proven application development 
framework for building data apps 
Application platform that addresses:
CASCADING 
• Java API 
• Separates business logic from integration 
• Testable at every lifecycle stage 
• Works with any JVM language 
• Many integration adapters 
12 
Processing API Integration API 
Process Planner 
Scheduler API 
Scheduler 
Apache Hadoop 
Cascading 
Data Stores 
Scripting 
Scala, Clojure, JRuby, Jython, Groovy Enterprise Java
BUSINESSES DEPEND ON US 
• Cascading Java API 
• Data normalization and cleansing of search and click-through logs for 
use by analytics tools, Hive analysts 
• Easy to operationalize heavy lifting of data in one framework 
13
BROAD SUPPORT 
14 
Hadoop ecosystem supports Cascading
CASCADING DEPLOYMENTS 
Confidential 
15
OPERATIONAL EXCELLENCE 
Visibility Through All Stages of App Lifecycle 
From Development — Building and Testing 
• Design & Development 
• Debugging 
• Tuning 
To Production — Monitoring and Tracking 
• Maintain Business SLAs 
• Balance & Controls 
• Application and Data Quality 
• Operational Health 
• Real-time Insights 
16
DRIVEN ARCHITECTURE
DEEPER VISUALIZATION INTO YOUR HADOOP CODE 
• Easily comprehend, debug, and tune 
your data applications 
• Get rich insights on your application 
performance 
• Monitor applications in real-time 
• Compare app performance with 
historical (previous) iterations 
18 
Debug and optimize your Hadoop applications more effectively with Driven
GET OPERATIONAL INSIGHTS WITH DRIVEN 
• Quickly breakdown how often 
applications execute based on their tags, 
teams, or names 
• Immediately identify if any application is 
monopolizing cluster resources 
• Understand the utilization of your cluster 
with a timeline of all applications running 
19 
Visualize the activity of your applications to help maintain SLAs
ORGANIZE YOUR APPLICATIONS WITH GREATER FIDELITY 
• Easily keep track of all your 
applications by segmenting them with 
user-defined tags 
• Segment your applications for 
trending analysis, cluster analysis, 
and developing chargeback models 
• Quickly breakdown how often 
applications execute based on their 
tags, teams, or names 
20 
Segment your applications for greater insights across all your applications
COLLABORATE WITH TEAMS 
Utilize teams to collaborate and gain visibility over your set of applications 
• Invite others to view and collaborate 
on a specific application 
• Gain visibility to all the apps and their 
owners associated with each team 
• Simply manage your teams and the 
users assigned to them 
21
MANAGE PORTFOLIO OF BIG DATA APPLICATIONS 
Fast, powerful, rich search capabilities enable you to easily find the exact set of 
• Identify problematic apps with their 
owners and teams 
• Search for groups of applications 
segmented by user-defined tags 
• Compare specific applications with their 
previous iterations to ensure that your 
application can meet its SL 
22 
applications that you’re looking for
DRIVEN FOR HIVE: OPERATIONAL VISIBILITY FOR YOUR HIVE APPS 
• Understand the anatomy of your Hive app 
• Track execution of queries as single business process 
• Identify outlier behavior by comparison with historical runs 
• Analyze rich operational meta-data 
• Correlate Hive app behavior with other events on cluster 
23
• Logstash provides a flexible and a robust way to collect 
log data; Grok lets you parse logs without coding. Kibana 
UI is attractive to analyze the information 
• Cascading is the de-facto framework for building Big Data 
(Hadoop) applications and processing data at scale 
• Cascading+Logstash let’s you develop applications to 
collect and process large volumes of data 
• With Driven, you can put your mission critical log-processing 
applications in production and monitor SLAs 
TAKE AWAY POINTS 
24
CONTACT INFORMATION 
Supreet Oberoi 
supreet@concurrentinc.com 
650-868-7675 (m) 
@supreet_online
DRIVING INNOVATION 
THROUGH DATA 
THANK YOU 
Supreet Oberoi

Contenu connexe

Tendances

Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseDataWorks Summit
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkItai Yaffe
 
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidOpen Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidDataWorks Summit
 
Big Data Processing with Spark and .NET - Microsoft Ignite 2019
Big Data Processing with Spark and .NET - Microsoft Ignite 2019Big Data Processing with Spark and .NET - Microsoft Ignite 2019
Big Data Processing with Spark and .NET - Microsoft Ignite 2019Michael Rys
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Data Con LA
 
Enabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataEnabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataDataWorks Summit
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieYahoo Developer Network
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackDataWorks Summit/Hadoop Summit
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsMars Lan
 
Data Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and KafkaData Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and KafkaDataWorks Summit
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDataWorks Summit
 
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...DataWorks Summit
 
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Data Con LA
 
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...DataWorks Summit
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2inovex GmbH
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Leveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningLeveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningEvans Ye
 
Presentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresecPresentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresecTiago Henriques
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerDataWorks Summit
 

Tendances (20)

Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
 
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and DruidOpen Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
Open Source Lambda Architecture with Hadoop, Kafka, Samza and Druid
 
Big Data Processing with Spark and .NET - Microsoft Ignite 2019
Big Data Processing with Spark and .NET - Microsoft Ignite 2019Big Data Processing with Spark and .NET - Microsoft Ignite 2019
Big Data Processing with Spark and .NET - Microsoft Ignite 2019
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
 
Enabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataEnabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government data
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
 
Data Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and KafkaData Ingest Self Service and Management using Nifi and Kafka
Data Ingest Self Service and Management using Nifi and Kafka
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
 
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
 
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
 
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
How to Use Innovative Data Handling and Processing Techniques to Drive Alpha ...
 
Azure Data Factory v2
Azure Data Factory v2Azure Data Factory v2
Azure Data Factory v2
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Leveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningLeveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioning
 
Presentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresecPresentation Brucon - Anubisnetworks and PTCoresec
Presentation Brucon - Anubisnetworks and PTCoresec
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
 

En vedette

Twitter 與 ELK 基本使用
Twitter 與 ELK 基本使用Twitter 與 ELK 基本使用
Twitter 與 ELK 基本使用Mark Dai
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemAvleen Vig
 
ELK Stack - Kibana操作實務
ELK Stack - Kibana操作實務ELK Stack - Kibana操作實務
ELK Stack - Kibana操作實務Kedy Chang
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaAvinash Ramineni
 
Elasitcsearch + Logstash + Kibana 日誌監控
Elasitcsearch + Logstash + Kibana 日誌監控Elasitcsearch + Logstash + Kibana 日誌監控
Elasitcsearch + Logstash + Kibana 日誌監控Jui An Huang (黃瑞安)
 
Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲
Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲
Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲ChinaNetCloud
 
Logging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & KibanaLogging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & KibanaAmazee Labs
 

En vedette (10)

Twitter 與 ELK 基本使用
Twitter 與 ELK 基本使用Twitter 與 ELK 基本使用
Twitter 與 ELK 基本使用
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log system
 
ELK Stack - Kibana操作實務
ELK Stack - Kibana操作實務ELK Stack - Kibana操作實務
ELK Stack - Kibana操作實務
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and Kibana
 
Elasitcsearch + Logstash + Kibana 日誌監控
Elasitcsearch + Logstash + Kibana 日誌監控Elasitcsearch + Logstash + Kibana 日誌監控
Elasitcsearch + Logstash + Kibana 日誌監控
 
Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop Summit Tokyo HDP Sandbox Workshop Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop Summit Tokyo HDP Sandbox Workshop
 
Elk stack
Elk stackElk stack
Elk stack
 
Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲
Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲
Dev-Ops与Docker的最佳实践 QCon2016 北京站演讲
 
Logging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & KibanaLogging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & Kibana
 
Comparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBaseComparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBase
 

Similaire à Elasticsearch + Cascading for Scalable Log Processing

Cascading concurrent yahoo lunch_nlearn
Cascading concurrent   yahoo lunch_nlearnCascading concurrent   yahoo lunch_nlearn
Cascading concurrent yahoo lunch_nlearnCascading
 
Accelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingAccelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingCascading
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Cask Data
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventTrivadis
 
Reducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop ApplicationsReducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop ApplicationsCascading
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
HBase Meetup @ Cask HQ 09/25
HBase Meetup @ Cask HQ 09/25HBase Meetup @ Cask HQ 09/25
HBase Meetup @ Cask HQ 09/25Cask Data
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarRTTS
 
DevOps and BigData Analytics
DevOps and BigData Analytics DevOps and BigData Analytics
DevOps and BigData Analytics sbbabu
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudKaran Singh
 
Kubernetes Infra 2.0
Kubernetes Infra 2.0Kubernetes Infra 2.0
Kubernetes Infra 2.0Deepak Sood
 
5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for AnalyticsJen Stirrup
 
Data saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewData saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewRiccardo Zamana
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsScyllaDB
 
Building Fast Applications for Streaming Data
Building Fast Applications for Streaming DataBuilding Fast Applications for Streaming Data
Building Fast Applications for Streaming Datafreshdatabos
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTurkish Testing Board
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About Jesus Rodriguez
 

Similaire à Elasticsearch + Cascading for Scalable Log Processing (20)

Cascading concurrent yahoo lunch_nlearn
Cascading concurrent   yahoo lunch_nlearnCascading concurrent   yahoo lunch_nlearn
Cascading concurrent yahoo lunch_nlearn
 
Accelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingAccelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with Cascading
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
 
Reducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop ApplicationsReducing Development Time for Production-Grade Hadoop Applications
Reducing Development Time for Production-Grade Hadoop Applications
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
HBase Meetup @ Cask HQ 09/25
HBase Meetup @ Cask HQ 09/25HBase Meetup @ Cask HQ 09/25
HBase Meetup @ Cask HQ 09/25
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
 
Cloud Native Application Development
Cloud Native Application DevelopmentCloud Native Application Development
Cloud Native Application Development
 
DevOps and BigData Analytics
DevOps and BigData Analytics DevOps and BigData Analytics
DevOps and BigData Analytics
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloud
 
Kubernetes Infra 2.0
Kubernetes Infra 2.0Kubernetes Infra 2.0
Kubernetes Infra 2.0
 
5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics
 
Hortonworks.bdb
Hortonworks.bdbHortonworks.bdb
Hortonworks.bdb
 
Data saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewData saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overview
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data Platforms
 
Building Fast Applications for Streaming Data
Building Fast Applications for Streaming DataBuilding Fast Applications for Streaming Data
Building Fast Applications for Streaming Data
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 

Plus de Cascading

Overview of Cascading 3.0 on Apache Flink
Overview of Cascading 3.0 on Apache Flink Overview of Cascading 3.0 on Apache Flink
Overview of Cascading 3.0 on Apache Flink Cascading
 
Predicting Hospital Readmission Using Cascading
Predicting Hospital Readmission Using CascadingPredicting Hospital Readmission Using Cascading
Predicting Hospital Readmission Using CascadingCascading
 
Cascading 2015 User Survey Results
Cascading 2015 User Survey ResultsCascading 2015 User Survey Results
Cascading 2015 User Survey ResultsCascading
 
Breathe new life into your data warehouse by offloading etl processes to hadoop
Breathe new life into your data warehouse by offloading etl processes to hadoopBreathe new life into your data warehouse by offloading etl processes to hadoop
Breathe new life into your data warehouse by offloading etl processes to hadoopCascading
 
How To Get Hadoop App Intelligence with Driven
How To Get Hadoop App Intelligence with DrivenHow To Get Hadoop App Intelligence with Driven
How To Get Hadoop App Intelligence with DrivenCascading
 
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...Cascading
 
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...Cascading
 
Cascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading
 
Introduction to Cascading
Introduction to Cascading  Introduction to Cascading
Introduction to Cascading Cascading
 

Plus de Cascading (9)

Overview of Cascading 3.0 on Apache Flink
Overview of Cascading 3.0 on Apache Flink Overview of Cascading 3.0 on Apache Flink
Overview of Cascading 3.0 on Apache Flink
 
Predicting Hospital Readmission Using Cascading
Predicting Hospital Readmission Using CascadingPredicting Hospital Readmission Using Cascading
Predicting Hospital Readmission Using Cascading
 
Cascading 2015 User Survey Results
Cascading 2015 User Survey ResultsCascading 2015 User Survey Results
Cascading 2015 User Survey Results
 
Breathe new life into your data warehouse by offloading etl processes to hadoop
Breathe new life into your data warehouse by offloading etl processes to hadoopBreathe new life into your data warehouse by offloading etl processes to hadoop
Breathe new life into your data warehouse by offloading etl processes to hadoop
 
How To Get Hadoop App Intelligence with Driven
How To Get Hadoop App Intelligence with DrivenHow To Get Hadoop App Intelligence with Driven
How To Get Hadoop App Intelligence with Driven
 
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
7 Best Practices for Achieving Operational Readiness on Hadoop with Driven an...
 
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
The Cascading (big) data application framework - André Keple, Sr. Engineer, C...
 
Cascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop WorldCascading - A Java Developer’s Companion to the Hadoop World
Cascading - A Java Developer’s Companion to the Hadoop World
 
Introduction to Cascading
Introduction to Cascading  Introduction to Cascading
Introduction to Cascading
 

Dernier

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 

Dernier (20)

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 

Elasticsearch + Cascading for Scalable Log Processing

  • 1. DRIVING INNOVATION THROUGH DATA LARGE-SCALE LOG PROCESSING WITH CASCADING & LOGSTASH Elasticsearch Meetup, Oct 30 2014
  • 2. WHAT IS LOG FILE ANALYTICS? • Making sense of large amounts of [semi|un]structured data • What type of log file data? ‣ Syslog ‣ Web log files (Apache, Nginix, WebTrends, Omniture) ‣ POS transactions ‣ Advertising impressions (Doubleclick DART, OpenX, Atlas) ‣ Twitter firehose (yes, it’s a log file!) • Anything with a timestamp and data 2
  • 3. LOGSTASH ARCHITECTURE 3 h"p://www.slashroot.in/logstash1tutorial1linux1central1logging1server7 • Data collec*on is flexible • Lots of input/output plugins • Grok filtering is easy • Kibana UI is a?rac*ve
  • 4. WHAT CAN WE DO WITH CASCADING + LOGSTASH? • Provide richer log-processing capabilities • Integrate & correlate with other information ‣ Large list of integration adapters • Analyze large volumes of log data • Capture & retain unfiltered log data • Operationalize your log-processing application 4
  • 5. GET TO KNOW CONCURRENT 5 Leader in Application Infrastructure for Big Data • Building enterprise software to simplify Big Data application development and management Products and Technology • CASCADING Open Source - The most widely used application infrastructure for building Big Data apps with over 175,000 downloads each month • DRIVEN Enterprise data application management for Big Data apps Proven — Simple, Reliable, Robust • Thousands of enterprises rely on Concurrent to provide their data application infrastructure. Founded: 2008 HQ: San Francisco, CA CEO: Gary Nakamura CTO, Founder: Chris Wensel www.concurrentinc.com
  • 6. CASCADING - DE-FACTO STANDARD FOR DATA APPS Cascading Apps 6 SQL Clojure Ruby New Fabrics Tez Storm Supported Fabrics and Data Stores Mainframe DB / DW In-Memory Data Stores Hadoop • Standard for enterprise data app development • Your programming language of choice • Cascading applications that run on MapReduce will also run on Apache Spark, Storm, and …
  • 7. CASCADING 3.0 7 “Write once and deploy on your fabric of choice.” • The Innovation — Cascading 3.0 will allow for data apps to execute on existing and emerging fabrics through its new customizable query planner. • Cascading 3.0 will support — Local In-Memory, Apache MapReduce and soon thereafter (3.1) Apache Tez, Apache Spark and Apache Storm Enterprise Data Applications Local In-Memory MapReduce Apache Tez, Storm, Computation Fabrics
  • 8. … AND INCLUDES RICH SET OF EXTENSIONS 8 http://www.cascading.org/extensions/
  • 9. DEMO: WORD COUNT EXAMPLE WITH CASCADING String docPath = args[ 0 ]; String wcPath = args[ 1 ]; Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); 9 configuration integration // create source and sink taps Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath ); Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath ); processing // specify a regex to split "document" text lines into token stream Fields token = new Fields( "token" ); Fields text = new Fields( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" ); // only returns "token" Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS ); // determine the word counts Pipe wcPipe = new Pipe( "wc", docPipe ); wcPipe = new GroupBy( wcPipe, token ); wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL ); scheduling // connect the taps, pipes, etc., into a flow definition FlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap ); // create the Flow Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work wcFlow.complete(); // <<-- Runs jobs on Cluster
  • 10. SOME COMMON PATTERNS • Functions • Filters • Joins ‣ Inner / Outer / Mixed ‣ Asymmetrical / Symmetrical • Merge (Union) • Grouping ‣ Secondary Sorting ‣ Unique (Distinct) • Aggregations ‣ Count, Average, etc 10 filter filter function function filter function data Pipeline Split Join Merge data Topology
  • 11. THE STANDARD FOR DATA APPLICATION DEVELOPMENT 11 www.cascading.org Build data apps that are scale-free Design principals ensure best practices at any scale Test-Driven Development Efficiently test code and process local files before deploying on a cluster Staffing Bottleneck Use existing Java, SQL, modeling skill sets Application Portability Write once, then run on different computation fabrics Operational Complexity Simple - Package up into one jar and hand to operations Systems Integration Hadoop never lives alone. Easily integrate to existing systems Proven application development framework for building data apps Application platform that addresses:
  • 12. CASCADING • Java API • Separates business logic from integration • Testable at every lifecycle stage • Works with any JVM language • Many integration adapters 12 Processing API Integration API Process Planner Scheduler API Scheduler Apache Hadoop Cascading Data Stores Scripting Scala, Clojure, JRuby, Jython, Groovy Enterprise Java
  • 13. BUSINESSES DEPEND ON US • Cascading Java API • Data normalization and cleansing of search and click-through logs for use by analytics tools, Hive analysts • Easy to operationalize heavy lifting of data in one framework 13
  • 14. BROAD SUPPORT 14 Hadoop ecosystem supports Cascading
  • 16. OPERATIONAL EXCELLENCE Visibility Through All Stages of App Lifecycle From Development — Building and Testing • Design & Development • Debugging • Tuning To Production — Monitoring and Tracking • Maintain Business SLAs • Balance & Controls • Application and Data Quality • Operational Health • Real-time Insights 16
  • 18. DEEPER VISUALIZATION INTO YOUR HADOOP CODE • Easily comprehend, debug, and tune your data applications • Get rich insights on your application performance • Monitor applications in real-time • Compare app performance with historical (previous) iterations 18 Debug and optimize your Hadoop applications more effectively with Driven
  • 19. GET OPERATIONAL INSIGHTS WITH DRIVEN • Quickly breakdown how often applications execute based on their tags, teams, or names • Immediately identify if any application is monopolizing cluster resources • Understand the utilization of your cluster with a timeline of all applications running 19 Visualize the activity of your applications to help maintain SLAs
  • 20. ORGANIZE YOUR APPLICATIONS WITH GREATER FIDELITY • Easily keep track of all your applications by segmenting them with user-defined tags • Segment your applications for trending analysis, cluster analysis, and developing chargeback models • Quickly breakdown how often applications execute based on their tags, teams, or names 20 Segment your applications for greater insights across all your applications
  • 21. COLLABORATE WITH TEAMS Utilize teams to collaborate and gain visibility over your set of applications • Invite others to view and collaborate on a specific application • Gain visibility to all the apps and their owners associated with each team • Simply manage your teams and the users assigned to them 21
  • 22. MANAGE PORTFOLIO OF BIG DATA APPLICATIONS Fast, powerful, rich search capabilities enable you to easily find the exact set of • Identify problematic apps with their owners and teams • Search for groups of applications segmented by user-defined tags • Compare specific applications with their previous iterations to ensure that your application can meet its SL 22 applications that you’re looking for
  • 23. DRIVEN FOR HIVE: OPERATIONAL VISIBILITY FOR YOUR HIVE APPS • Understand the anatomy of your Hive app • Track execution of queries as single business process • Identify outlier behavior by comparison with historical runs • Analyze rich operational meta-data • Correlate Hive app behavior with other events on cluster 23
  • 24. • Logstash provides a flexible and a robust way to collect log data; Grok lets you parse logs without coding. Kibana UI is attractive to analyze the information • Cascading is the de-facto framework for building Big Data (Hadoop) applications and processing data at scale • Cascading+Logstash let’s you develop applications to collect and process large volumes of data • With Driven, you can put your mission critical log-processing applications in production and monitor SLAs TAKE AWAY POINTS 24
  • 25. CONTACT INFORMATION Supreet Oberoi supreet@concurrentinc.com 650-868-7675 (m) @supreet_online
  • 26. DRIVING INNOVATION THROUGH DATA THANK YOU Supreet Oberoi