Accelerate Big Data Application Development with Cascading and HDP, webinar hosted by Hortonworks and Concurrent. Visit Hortonworks.com/webinars to access the recording.
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Accelerate Big Data Application Development with Cascading and HDP, Hortonworks and Concurrent webinar 4-22-2014
1. Page 1
Accelerate Big Data
Application Development with
Cascading and HDP
April 22, 2014
2. Page 2
Agenda
• Take advantage of the latest Hadoop processing
frameworks like YARN and Tez in HDP 2.1
• How developers can create future proof, data-driven
applications built on Apache Hadoop with Cascading
• How Cascading accelerates Hadoop application
development by abstracting the platforms underneath
3. Page 3
Speakers
Ajay Singh, Director of
Technical Channels,
Hortonworks
Supreet Oberoi, VP of
Field Engineering,
Concurrent
4. Page 4
Open
Leadership
Drive innovation in
the open exclusively
via the Apache
community-driven
open source process
Enterprise
Rigor
Engineer, test and
certify Apache Hadoop
with the enterprise in
mind
Ecosystem
Endorsement
Focus on deep
integration with
existing data center
technologies and
skills
Enable your Modern Data Architecture
by delivering Enterprise Apache Hadoop
Our
Mission:
Reseller Partners:
Headquartered in Palo Alto, CA; 300+ employees and growing
5. Page 5
A data architecture under pressure
from new data
APPLICATIONS*DATA**SYSTEM*
REPOSITORIES*
SOURCES*
Exis4ng*Sources**
(CRM,*ERP,*Clickstream,*
Logs)*
RDBMS* EDW* MPP*
Business**
Analy4cs*
Custom*
Applica4ons*
Packaged*
Applica4ons*
Source: IDC
2.8*ZB*in*2012*
85%*from*New*Data*Types*
15x*Machine*Data*by*2020*
40*ZB*by*2020*
OLTP,&ERP,&CRM&
Systems&
Unstructured&documents,&
emails&
Clickstream&
Server&logs&
Sen>ment,&Web&
Data&
Sensor.&Machine&
Data&
Geoloca>on&
6. Page 6
A Modern Data ArchitectureAPPLICATIONS*DATA**SYSTEM*
REPOSITORIES*
SOURCES*
Exis4ng*Sources**
(CRM,*ERP,*Clickstream,*Logs)*
RDBMS* EDW* MPP*
Emerging*Sources**
(Sensor,*Sen4ment,*Geo,*Unstructured)*
OPERATIONAL*
TOOLS*
MANAGE*&*
MONITOR*
DEV*&*DATA*
TOOLS*
BUILD*&*
TEST*
Business**
Analy4cs*
Custom*
Applica4ons*
Packaged*
Applica4ons*
Governance
&Integration
ENTERPRISE HADOOP
Security
Operations
Data Access
Data Management
7. Page 7
Clickstream
Capture and
analyze website
visitors’ data trails
and optimize your
website
Sensors
Discover
patterns in data
streaming
automatically
from remote
sensors and
machines
Server Logs
Research logs to
diagnose process
failures and
prevent security
breaches
New types of dataHadoop
Value:
Sentiment
Understand how
your customers
feel about your
brand and
products –
right now
Geographic
Analyze
location-based
data to manage
operations
where they
occur
Unstructured
Understand patterns
in files across
millions of web
pages, emails, and
documents
9. Page 9
Core Capabilities of Enterprise Hadoop
Load data
and manage
according
to policy
Deploy and
effectively
manage the
platform
Store and process all of your Corporate Data Assets
&
Access your data simultaneously in multiple ways
(batch, interactive, real-time) Provide layered
approach to
security through
Authentication,
Authorization,
Accounting, and
Data Protection
&
DATA**MANAGEMENT*
SECURITY*DATA**ACCESS*
GOVERNANCE*&*
INTEGRATION*
OPERATIONS*
Enable both existing and new application to
provide value to the organization
PRESENTATION*&*APPLICATION*
Empower existing operations and
security tools to manage Hadoop
ENTERPRISE*MGMT*&*SECURITY*
Provide deployment choice across physical, virtual, cloud
DEPLOYMENT*OPTIONS*
11. Page 11
Hadoop is wholly integrated
into the data center
APPLICATIONS*DATA**SYSTEM*SOURCES*
RDBMS* EDW* MPP*
Emerging*Sources**
(Sensor,*Sen4ment,*Geo,*Unstructured)*
HANA
BusinessObjects BI
OPERATIONAL*TOOLS*
DEV*&*DATA*TOOLS*
Exis4ng*Sources**
(CRM,*ERP,*Clickstream,*Logs)*
INFRASTRUCTURE*
HDP 2.1Governance
&Integration
Security
Operations
Data Access
Data Management
12. Page 12
Developing Apps on Hadoop
• Spring XD Framework
– Consistent configuration & Java API across wide range of Hadoop ecosystem
projects
• Microsoft .NET SDK For Hadoop
– API access to HDP on windows and HDInsight service
– LINQ libraries for accessing Hive
• Cascading
– Delivers an easy to use abstraction layer for developing Hadoop applications
– Supports development in Scala & Clojure
– Hortonworks to Certify, Support & Deliver Cascading SDK with Hortonworks Data
Platform
14. HORTONWORKSPARTNERSWITHCONCURRENT
• The Cascading SDK will now be integrated with the
Hortonworks Data Platform (HDP)
• Hortonworks will certify and support Cascading™
SDK with HDP
• Cascading will support Apache Tez; companies using
Cascading or domain-specific languages on
Cascading can seamlessly migrate HDP supporting
Apache Tez
The partnership benefits users by combining the power and simplicity of
Cascading with the reliability and stability of HDP.
15. Confidential
AGENDA
3
• Who is Concurrent
• What is Cascading
• Where is it used
• What problems does Cascading solve
• What is included in the Cascading kit
!
17. Confidential
GETTOKNOWCONCURRENT
5
Leader in Application Infrastructure for Big Data!
• Building enterprise software to simplify Big Data application
development and management
Products and Technology!
• CASCADING
The most widely used application infrastructure for building Big
Data applications with over 150,000 downloads each month
• DRIVEN
Enterprise Data Application management for Big Data apps
Proven - Simple, Reliable, Robust!
• Thousands of enterprises rely on Concurrent to provide their
data application infrastructure.
Founded: 2008
HQ: San Francisco, CA
!
CEO: Gary Nakamura
CTO, Founder: Chris Wensel
!
www.concurrentinc.com
18. PRODUCTSANDTECHNOLOGY
!
!
Big Data Application Development!
Simple, Reliable, Repeatable
!
!
Unmatched Application Insight!
Visibility into your Data Applications
Open Source Commercial
www.concurrentinc.com/products
Open Source Community!
Focused on Data App Development
!
Project home of Cascading
Collection of sub-projects / tools
!
!
Data App Management!
Realtime monitoring
Performance Management
Operational Control
Data Provenance
Compliance Governance
19. BUSINESSESDEPENDONUS
• Cascading Java API
• Data normalization and cleansing of search and click-through
logs for use by analytics tools, Hive analysts
• Easy to operationalize heavy lifting of data
20. BUSINESSESDEPENDONUS
• Cascalog (Clojure)
• Weather pattern modeling to protect growers against loss
• ETL against 20+ datasets daily
• Machine learning to create models
• Purchased by Monsanto for $930M US
21. BUSINESSESDEPENDONUS
• Scalding (Scala)
• Machine learning (linear algebra) to improve
• User experience
• Ad quality (matching users and ad effectiveness)
• All revenue applications are running on Cascading/Scalding
• IPO
TWITTER
22. BUSINESSESDEPENDONUS
• Estimate suicide risk from what people write online
• Cascading + Cassandra
• You can do more than optimize add yields
• http://www.durkheimproject.org
24. DRIVINGADVANTAGEWITHDATAAPPLICATIONS
Enterprise IT!
Extract Transform Load
Log File Analysis
Systems Integration
Operations Analysis
!
Corporate Apps!
HR Analytics
Employee Behavioral Analysis
Customer Support | eCRM
Business Reporting
!
Telecom!
Data processing of Open Data
Geospatial Indexing
Consumer Mobile Apps
Location based services
Marketing / Retail!
Mobile, Social, Search Analytics
Funnel analysis
Revenue attribution
Customer experiments
Ad Optimization
Retail recommenders
!
Consumer / Entertainment!
Music Recommendation
Comparison Shopping
Restaurant Rankings
Real Estate
Rental Listings
Travel Search & Forecast
!
!
Finance!
Fraud and Anomaly Detection
Fraud Experiments
Customer Analytics
Insurance Risk Metric
!
Health / Biotech!
Aggregate metrics for Govt
Person biometrics
Veterinary diagnostics
Next-Gen Genomics
Argonomics
Environmental Maps
!
25. BIGDATA—THENEXTPHASEOFMATURITY
“It’s all about the Apps”"
There needs to be a comprehensive solution for building, deploying, running and
managing these new class of enterprise applications
Business Strategy Data & Technology
Loyalty and promotions analysis
Retention campaigns
Marketing campaign optimization
Fraud detection
Risk management
Scientific research
Remote monitoring and diagnosis
and more!
Your Data & Systems
Hadoop, EDW, Mainframe,
System Logs, NO SQL DBs, etc.Challenges!
!
Leveraging existing skill sets,
existing systems, past investments
and existing business processes
Connecting Business and Data
27. • Java API (alternative to Hadoop MapReduce)
• Separates business logic from integration
• Testable at every lifecycle stage
• Works with any JVM language
• Many integration adapters
CASCADING
15
Process Planner
Processing API Integration API
Scheduler API
Scheduler
Apache Hadoop
Cascading
Data Stores
Scripting
Scala, Clojure, JRuby, Jython, Groovy
Enterprise Java
30. • Functions
• Filters
• Joins
‣ Inner / Outer / Mixed
‣ Asymmetrical / Symmetrical
• Merge (Union)
• Grouping
‣ Secondary Sorting
‣ Unique (Distinct)
• Aggregations
‣ Count, Average, etc
‣ Rolling windows
SOMECOMMONPATTERNS
18
filter
filter
function
functionfilterfunction
data
Pipeline
Split Join
Merge
data
Topology
31. WORDCOUNTEXAMPLE!
!
String docPath = args[ 0 ];!
String wcPath = args[ 1 ];!
Properties properties = new Properties();!
AppProps.setApplicationJarClass( properties, Main.class );!
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );!
!
configuration
integration
!
// create source and sink taps!
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );!
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );!
!
processing
// specify a regex to split "document" text lines into token stream!
Fields token = new Fields( "token" );!
Fields text = new Fields( "text" );!
RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ [](),.]" );!
// only returns "token"!
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );!
// determine the word counts!
Pipe wcPipe = new Pipe( "wc", docPipe );!
wcPipe = new GroupBy( wcPipe, token );!
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );!
scheduling
!
// connect the taps, pipes, etc., into a flow definition!
FlowDef flowDef = FlowDef.flowDef().setName( "wc" )!
.addSource( docPipe, docTap )!
.addTailSink( wcPipe, wcTap );!
// create the Flow!
Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work!
wcFlow.complete(); // <<-- Runs jobs on Cluster
32. CASCADINGOVERVIEW
www.cascading.org
Build Data
Apps that are
scale-free!!
!
!
Design principals ensure
best practices at any scale
Test-Driven
Development!
!
Efficiently test code and
process local files before
you deploy on a cluster
Staffing
Bottleneck!
!
Use existing Java, SQL,
modeling skills sets
Operational
Complexity!
!
Simple - Package up into
one jar and hand to
operations
Application
Portability!
!
!
Write once, then run on
different computation
fabrics.
Systems
Integration!
!
!
Hadoop never lives alone.
Easily integrate to your
existing systems!
Proven application development
framework for building Data
applications
Framework addresses
34. PRODUCTSANDTECHNOLOGY
LINGUAL Simplifying Systems Integration
PATTERN Enabling Machine Scoring Algorithms
!
!
Big Data Application Development!
Simple, Reliable, Repeatable
!
!
Unmatched Application Insight!
Visibility into your Data Applications
Open Source Commercial
www.concurrentinc.com/products
36. LINGUAL
• Lingual is an extension to Cascading that
executes ANSI SQL queries as Cascading
apps!
• Supports integrating with any data source
that can be accessed through JDBC —
Cascading Tap can be created for any
source supporting JDBC!
• Great for migration of data, integrating
with non-Big Data assets — extends life
of existing IT assets in an organization
Query Planner
JDBC API Lingual APIProvider API
Cascading
Apache Hadoop
Lingual
Data Stores
CLI / Shell Enterprise Java
Catalog
37. SCALDING
• Scalding is a language binding to Cascading for Scala!
- The name Scalding comes from the combining of SCALa and
cascaDING!
• Scalding is great for Scala developers; can crisply write
constructs for matrix math… !
• Scalding has very large commercial deployments at:!
- Twitter - Use cases such as the revenue quality team, ad
targeting and traffic quality!
- Ebay - Use cases include search analytics and other production
data pipelines
38. DRIVENOVERVIEW
What is Driven?!
The first application
performance management
product for Big Data
applications
Capabilities
Visualize your
Data App!
No more black box!
Instantly visualize your
running app in real-time
Diagnose App
Failures!
Identify where and how your
app failed… all without
sorting through logs!
Track App
Performance!
For all your apps, view and
compare history of your
app’s runtime performance
Insight into your
Applications!
At any moment, quickly
understand what your app
is doing on your cluster
LINGUAL
PATTERN
SCALDING
CASCALOG
Benefits
Key Features
• Accelerate Time to Market
• Build Reliable Applications
• Optimize Application Performance
• Application visualization
• Dashboard performance view
• Application performance history
• Insights for each application (workflow,
telemetry, error types)
• Team collaboration and management
Works with:
www.cascading.io
40. Lingual Pattern
Availability
Cascading 2.5
Available Now
Lingual 1.1
Available Now
Pattern 1.0-WIP
WIP Available Now
License Apache License 2.0 Apache License 2.0 Apache License 2.0
Support
Community Forums &
Mailing List, Enterprise
Support
Community Forums &
Mailing List, Enterprise
Support
Community Forums &
Mailing List, Enterprise
Support
CASCADINGAVAILABILITY
Cascading, Lingual and Pattern are open source projects freely available to the general public under Apache License 2.0
41. ConfidentialConfidential29
Summary!
• APM for Big Data | The first application performance management product for Big Data applications
!
!
!
!
• For Developers and Operators | Significantly improves developer productivity and operations control by providing an
unprecedented level of insight into building and managing enterprise-grade data applications
• Collaboration | Facilitates and encourages user collaboration to build enterprise data applications
• Community Integration | Driven is a free cloud service integrated with the Cascading open source community
• Licensing | Driven is free for development (cloud only) and licensable for production or on-premise deployments
• Deployment Options | Deploy in the cloud or on-premise
Accelerate Time to Market
Process visualization and monitoring
capabilities in a rich UI
Build Reliable Apps
Detailed insight into data processing
logic and algorithms
Optimize App Performance
Key application behavior metrics with
historical data to trend performance
45. Page 13
SAN JOSE
June 3-5
AMSTERDAM
April 2-3
• 6 tracks, 3 days, and 120+ sessions to choose from
• Community Focused - Sessions voted on by the public and
selected by a committee of industry luminaries
• Deep Dive Technical Content - Including a Committer track with
content presented by Apache committers
• Business and Technical Topics
• Community Activities - Hadoop Summit will host community meet-
ups and birds of a feather sessions
www.hadoopsummit.org
The Largest Hadoop Community Events in
Europe and North America
46. Page 14
Questions?
Use the Q/A panel to ask your questions
Download the Hortonworks Sandbox and Cascading
• Cascading and HDP 2.1 Sandbox
• Hortonworks Sandbox
• Cascading Impatient Tutorial