SlideShare une entreprise Scribd logo
1  sur  47
© Rocana, Inc. All Rights Reserved. | 1
Joey Echeverria, Platform Technical Lead
Strata+Hadoop World, March 31st 2016
San Jose, CA
Embeddable data transformation for
real-time streams
© Rocana, Inc. All Rights Reserved. | 2 http://j.mp/hw-questions
Slides
http://j.mp/rocana-transform-slides
© Rocana, Inc. All Rights Reserved. | 3 http://j.mp/hw-questions
Questions
http://j.mp/hw-questions
© Rocana, Inc. All Rights Reserved. | 4 http://j.mp/hw-questions
Context
© Rocana, Inc. All Rights Reserved. | 5 http://j.mp/hw-questions
Joey
• Where I work: Rocana – Platform Technical Lead
• Where I used to work: Cloudera (’11-’15), NSA
• Distributed systems, security, data processing, big data
© Rocana, Inc. All Rights Reserved. | 6
Signing today at 1pm at the
Cloudera booth
© Rocana, Inc. All Rights Reserved. | 7 http://j.mp/hw-questions
History
© Rocana, Inc. All Rights Reserved. | 8 http://j.mp/hw-questions
Spark
Impala
“Legacy” data architecture
HDFS
Avro/Parquet FilesFlume/Sqoop
Data Producers
MapReduc
e
Visualization/Query
© Rocana, Inc. All Rights Reserved. | 9 http://j.mp/hw-questions
Flink
Storm
Stream data architecture
Kafka
Avro Serialized
Recrods
Data Producers Spark Streaming
Real-time Visualization
HDFS
Avro/Parquet FilesKafka Consumers
© Rocana, Inc. All Rights Reserved. | 10 http://j.mp/hw-questions
Flink
Storm
Stream data architecture
Kafka
Avro Serialized
Recrods
Data Producers Spark Streaming
Real-time Visualization
HDFS
Avro/Parquet FilesKafka Consumers
© Rocana, Inc. All Rights Reserved. | 11 http://j.mp/hw-questions
Stream processing
A primer
© Rocana, Inc. All Rights Reserved. | 12 http://j.mp/hw-questions
Stream processing
• Filter
• Extract
• Project
• Aggregate
• Join
• Model
© Rocana, Inc. All Rights Reserved. | 13 http://j.mp/hw-questions
Stream processing
• Filter
• Extract
• Project
• Aggregate
• Join
• Model
© Rocana, Inc. All Rights Reserved. | 14 http://j.mp/hw-questions
Stream processing
• Filter
• Extract
• Project
• Aggregate
• Join
• Model
• Data transformation
© Rocana, Inc. All Rights Reserved. | 15 http://j.mp/hw-questions
Apache Storm
• "Distributed real-time computation system"
• Applications packaged into topologies (think MapReduce job)
• Topologies operate over streams of tuples
• Spout: source of a stream
• Bolt: arbitrary operation such as filtering, aggregating, joining, or
executing arbitrary functions
© Rocana, Inc. All Rights Reserved. | 16 http://j.mp/hw-questions
Apache Spark
• Supports batch and stream processing
• Continuous stream of records discretized into a DStream
• DStream: a sequence of RDDs (batches of records)
• Micro-batch
© Rocana, Inc. All Rights Reserved. | 17 http://j.mp/hw-questions
Apache Flink
• Supports batch and stream processing
• DataStream: unbounded collection of records
• Operations can apply to individual records or windows of records
• Supports record-at-a-time processing (like Storm)
© Rocana, Inc. All Rights Reserved. | 18 http://j.mp/hw-questions
Apache Kafka
• Pub-sub messaging system implemented as a distributed commit log
• Popular as a source and sink for data streams
• Scalability, durability, and easy-to-understand delivery guarantees
• Can do stream processing directly in Kafka consumers
© Rocana, Inc. All Rights Reserved. | 19 http://j.mp/hw-questions
Data transformation
© Rocana, Inc. All Rights Reserved. | 20 http://j.mp/hw-questions
Filter
filter
© Rocana, Inc. All Rights Reserved. | 21 http://j.mp/hw-questions
Extract
127.0.0.1 Mozilla/5.0 laura [31/Mar/2016] "GET /index.html HTTP/1.0" 200 2326
ts: 1436576671000
body: <binary blob>
event_type_id: 100
...
extract
ts: 1436576671000
body: <binary blob>
event_type_id: 100
attributes: {
ip: "127.0.0.1"
user_agent: "Mozilla/5.0"
user_id: "laura"
date: "[31/March/2016]"
request: "GET /index.html HTTP/1.0"
status_code: "200"
size: "2326"
}
© Rocana, Inc. All Rights Reserved. | 22 http://j.mp/hw-questions
Project
ts: 1436576671000
body: <binary blob>
event_type_id: 100
attributes: {
ip: "127.0.0.1"
user_agent: "Mozilla/5.0"
user_id: "laura"
date: "[31/March/2016]"
request: "GET /index.html HTTP/1.0"
status_code: "200"
size: "2326"
}
ts: 1459444413000
ip: "127.0.0.1"
user_agent: "Mozilla/5.0"
user_id: "laura"
request: "GET /index.html HTTP/1.0"
status_code: 200
size: 2326
project
© Rocana, Inc. All Rights Reserved. | 23 http://j.mp/hw-questions
Problem
© Rocana, Inc. All Rights Reserved. | 24 http://j.mp/hw-questions
Who
• Developers
• Data engineers
• Sysadmins
• Analysts
© Rocana, Inc. All Rights Reserved. | 25 http://j.mp/hw-questions
Tools
© Rocana, Inc. All Rights Reserved. | 26 http://j.mp/hw-questions
The dark art of data science
• Feature engineering
• “Getting a mess of raw data that can be used as input to a machine
learning algorithm” - @josh_wills
• Video from Midwest.io 2014
© Rocana, Inc. All Rights Reserved. | 27 http://j.mp/hw-questions
Data transformation for all
© Rocana, Inc. All Rights Reserved. | 28 http://j.mp/hw-questions
Rocana Transform
• Library
• Java
• Rocana configuration
• JSON + comments + specific numeric types - excess quoting
© Rocana, Inc. All Rights Reserved. | 29 http://j.mp/hw-questions
Data model
• Event schema
• id: A globally unique identifier for this event
• ts: Epoch timestamp in milliseconds
• event_type_id: ID indicating the type of the event
• location: Location from which the event was generated
• host: Hostname, IP, or other device identifier from which the event was
generated
• service: Service or process from which the event was generated
• body: Raw event content in bytes
• attributes: Event type-specific key/value pairs
© Rocana, Inc. All Rights Reserved. | 30 http://j.mp/hw-questions
Example event
{
"id": "JRHAIDMLCKLEAPMIQDHFLO3MXYXV7NVBEJNDKZGS2XVSEINGGBHA====",
"event_type_id": 100,
"ts": 1436576671000,
"location": "aws/us-west-2a",
"host": "example01.rocana.com",
"service": "dhclient",
"body": "<36>Jul 10 18:04:31 gs09.example.com dhclient[865] DHCPACK from …",
"attributes": {
"syslog_timestamp": "1436576671000",
"syslog_process": "dhclient",
"syslog_pid": "865",
"syslog_facility": "3",
"syslog_severity": "6",
"syslog_hostname": "example01",
"syslog_message": "DHCPACK from 10.10.1.1 (xid=0x5c64bdb0)"
}
}
© Rocana, Inc. All Rights Reserved. | 31 http://j.mp/hw-questions
Filter, extract, and flatten
© Rocana, Inc. All Rights Reserved. | 32 http://j.mp/hw-questions
Filter, extract, and flatten
• Filter out events without type id 100
• Filter out events without hostname prefix "ex"
• Extract a numeric prefix from the syslog message
• Flatten syslog attributes to top-level fields in a different avro schema
© Rocana, Inc. All Rights Reserved. | 33 http://j.mp/hw-questions
Filter, extract, and flatten
{
load-event: {},
// Filter by event_type_id
filter: { expression: "${event_type_id == 100}" },
// Extract hostname prefix
regex: { ... },
filter: { expression: "${host_prefix.match.group.1 == 'ex'}",
// Extract a numeric prefix from the syslog message
regex: { ... },
// Build flattened record
build-avro-record: { ... },
// Accumulate output record
accumulate-output: {
value: "${output_record}"
}
}
© Rocana, Inc. All Rights Reserved. | 34 http://j.mp/hw-questions
Extract hostname prefix
{
load-event: {},
filter: { expression: "${event_type_id == 100}" },
regex: {
pattern: "^(.{2}).*$",
value: "${attr.syslog_hostname}",
destination: "host_prefix"
},
filter: { expression: "${host_prefix.match.group.1 == 'ex'}",
...
}
© Rocana, Inc. All Rights Reserved. | 35 http://j.mp/hw-questions
Extract numeric prefix
...
filter: { expression: "${host_prefix.match.group.1 == 'ex'}",
regex: {
pattern: "^([0-9]*)",
value: "${attributes['syslog_message']}",
destination: "msg",
match-actions: {
set-values: { extracted_field: "${msg.match.group.1}" }
},
no-match-actions: {
set-values: { extracted_field: "" }
}
},
...
© Rocana, Inc. All Rights Reserved. | 36 http://j.mp/hw-questions
Build flattened record
...
build-avro-record: {
schema-uri: "resource:avro-schemas/flattened-syslog.avsc",
destination: "output_record",
field-mapping: {
ts: "${ts}",
event_type_id: "${event_type_id}",
source: "${source}",
syslog_facility: "${convert:toInt(attributes['syslog_facility'])}",
syslog_severity: "${convert:toInt(attributes['syslog_severity'])}",
...
syslog_message: "${attributes['syslog_message']}",
syslog_pid: "${convert:toInt(attributes['syslog_pid)}",
extracted_field: "${extracted_field}"
},
},
...
© Rocana, Inc. All Rights Reserved. | 37 http://j.mp/hw-questions
Extract metrics from log data
© Rocana, Inc. All Rights Reserved. | 38 http://j.mp/hw-questions
Extract metrics
• Input: HTTP status logs
• Extract request latency
• Extract counts by HTTP status code
• Metric types
• Guage: A value that varies over time (think latency, CPU %, etc.)
• Counter: A value that accumulates over time (think event volume, status codes,
etc.)
© Rocana, Inc. All Rights Reserved. | 39 http://j.mp/hw-questions
Example metric event
{
"id": "JRHAIDMLCKLEAPMIQDHFLO3MXBBQ7NVBEJNDKZGS2XVSEINGGBHA====",
"event_type_id": 107,
"ts": 1436576671000,
"location": "aws/us-west-2a",
"host": "web01.rocana.com",
"service": "httpd",
"attributes": {
"m.http.request.latency": "4.2000000000E1|g",
"m.http.status.401.count": "1.0000000000E0|c",
}
}
© Rocana, Inc. All Rights Reserved. | 40 http://j.mp/hw-questions
Extract metrics
{
load-event: {},
build-metric: {
gauge-mapping: {
http.request.latency: "${convert:toDouble(attributes['latency'])}"
},
destination: "latency_metric"
},
accumulate-output: { value: "${latency_metric}" },
build-metric: {
dynamic-counter-mapping: [
"${string:format('http.status.%s.count', attributes['sc_status'])}", 1D
],
destination: "status_metric"
},
accumulate-output: { value: "${status_metric}" }
}
© Rocana, Inc. All Rights Reserved. | 41 http://j.mp/hw-questions
Architecture
© Rocana, Inc. All Rights Reserved. | 42 http://j.mp/hw-questions
Java action objects
Architecture
Configuration file Java action objects Context
Variables
Driver
1. Parse config
2. Initialize
context
5. Copy output
3. Execute actions
4. Read/write
variables
© Rocana, Inc. All Rights Reserved. | 43 http://j.mp/hw-questions
Custom actions
• Actions loaded at runtime using Java services framework
• Add your jar to the classpath
• Custom actions appear as top-level keywords just like regular actions
• Implement the execute() method of the Action interface
• Implement the build() method of the ActionBuilder interface
© Rocana, Inc. All Rights Reserved. | 44 http://j.mp/hw-questions
Custom actions
• Parse custom log formats
• Cisco ACS
• Citrix
• Juniper
• Customer-specific formats
• Lookup IP addresses in the MaxMind GeoIP2 database
• Reference dataset lookups
• Device id to device name
© Rocana, Inc. All Rights Reserved. | 45 http://j.mp/hw-questions
Putting it all together
• Stream processing is causing us to re-think how we analyze data
• Limiting accessibility of data transformation side increases costs and
decreases velocity
• Reduce your reliance on developers to code custom pipelines
• Re-use transformation configuration in any stream processing framework
or batch job
© Rocana, Inc. All Rights Reserved. | 46 http://j.mp/hw-questions
Coming soon
• Rocana transform will be released under the ASL 2.0
• The base configuration library is available today:
• https://github.com/scalingdata/rocana-configuration
© Rocana, Inc. All Rights Reserved. | 47 http://j.mp/hw-questions
Questions?
• Signing "Hadoop Security" today at 1pm at the Cloudera booth

Contenu connexe

Tendances

Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksData Con LA
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveDataWorks Summit/Hadoop Summit
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit
 
Introduction to Apache NiFi And Storm
Introduction to Apache NiFi And StormIntroduction to Apache NiFi And Storm
Introduction to Apache NiFi And StormJungtaek Lim
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoopgregchanan
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018alanfgates
 
Building and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieBuilding and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieDataWorks Summit/Hadoop Summit
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDataWorks Summit
 
Troubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastTroubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastDataWorks Summit
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesDataWorks Summit
 

Tendances (20)

Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Cost-based Query Optimization
Cost-based Query Optimization Cost-based Query Optimization
Cost-based Query Optimization
 
Active Learning for Fraud Prevention
Active Learning for Fraud PreventionActive Learning for Fraud Prevention
Active Learning for Fraud Prevention
 
Introduction to Apache NiFi And Storm
Introduction to Apache NiFi And StormIntroduction to Apache NiFi And Storm
Introduction to Apache NiFi And Storm
 
LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018
 
Building and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieBuilding and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache Oozie
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Troubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastTroubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the Beast
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on Kubernetes
 

En vedette

Hybrid & Logical Data Warehouse
Hybrid & Logical Data WarehouseHybrid & Logical Data Warehouse
Hybrid & Logical Data WarehouseHeungsoon Yang
 
Data Virtualization Reference Architectures: Correctly Architecting your Solu...
Data Virtualization Reference Architectures: Correctly Architecting your Solu...Data Virtualization Reference Architectures: Correctly Architecting your Solu...
Data Virtualization Reference Architectures: Correctly Architecting your Solu...Denodo
 
Introduction to sentry
Introduction to sentryIntroduction to sentry
Introduction to sentrymozillazg
 
Supporting Data Services Marketplace using Data Virtualization
Supporting Data Services Marketplace using Data VirtualizationSupporting Data Services Marketplace using Data Virtualization
Supporting Data Services Marketplace using Data VirtualizationDenodo
 
Apache Sentry for Hadoop security
Apache Sentry for Hadoop securityApache Sentry for Hadoop security
Apache Sentry for Hadoop securitybigdatagurus_meetup
 
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...Denodo
 
빅데이터 플랫폼 새로운 미래
빅데이터 플랫폼 새로운 미래빅데이터 플랫폼 새로운 미래
빅데이터 플랫폼 새로운 미래Wooseung Kim
 
Logical Data Warehouse and Data Lakes
Logical Data Warehouse and Data Lakes Logical Data Warehouse and Data Lakes
Logical Data Warehouse and Data Lakes Denodo
 
Big Data Industry Insights 2015
Big Data Industry Insights 2015 Big Data Industry Insights 2015
Big Data Industry Insights 2015 Den Reymer
 
Real-time Analytics in Financial: Use Case, Architecture and Challenges
Real-time Analytics in Financial: Use Case, Architecture and ChallengesReal-time Analytics in Financial: Use Case, Architecture and Challenges
Real-time Analytics in Financial: Use Case, Architecture and ChallengesDataWorks Summit/Hadoop Summit
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modelingvivekjv
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

En vedette (14)

Hybrid & Logical Data Warehouse
Hybrid & Logical Data WarehouseHybrid & Logical Data Warehouse
Hybrid & Logical Data Warehouse
 
Data Virtualization Reference Architectures: Correctly Architecting your Solu...
Data Virtualization Reference Architectures: Correctly Architecting your Solu...Data Virtualization Reference Architectures: Correctly Architecting your Solu...
Data Virtualization Reference Architectures: Correctly Architecting your Solu...
 
Introduction to sentry
Introduction to sentryIntroduction to sentry
Introduction to sentry
 
Supporting Data Services Marketplace using Data Virtualization
Supporting Data Services Marketplace using Data VirtualizationSupporting Data Services Marketplace using Data Virtualization
Supporting Data Services Marketplace using Data Virtualization
 
Apache Sentry for Hadoop security
Apache Sentry for Hadoop securityApache Sentry for Hadoop security
Apache Sentry for Hadoop security
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...
 
빅데이터 플랫폼 새로운 미래
빅데이터 플랫폼 새로운 미래빅데이터 플랫폼 새로운 미래
빅데이터 플랫폼 새로운 미래
 
Logical Data Warehouse and Data Lakes
Logical Data Warehouse and Data Lakes Logical Data Warehouse and Data Lakes
Logical Data Warehouse and Data Lakes
 
Big Data Industry Insights 2015
Big Data Industry Insights 2015 Big Data Industry Insights 2015
Big Data Industry Insights 2015
 
Big Data Security and Governance
Big Data Security and GovernanceBig Data Security and Governance
Big Data Security and Governance
 
Real-time Analytics in Financial: Use Case, Architecture and Challenges
Real-time Analytics in Financial: Use Case, Architecture and ChallengesReal-time Analytics in Financial: Use Case, Architecture and Challenges
Real-time Analytics in Financial: Use Case, Architecture and Challenges
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Similaire à Embeddable data transformation for real time streams

Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Felicia Haggarty
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applicationsJoey Echeverria
 
Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015Eric Sammer
 
Building a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaBuilding a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaTreasure Data, Inc.
 
Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer confluent
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Timothy Spann
 
Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...Eric Sammer
 
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016cdmaxime
 
Perfect Norikra 2nd Season
Perfect Norikra 2nd SeasonPerfect Norikra 2nd Season
Perfect Norikra 2nd SeasonSATOSHI TAGOMORI
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?confluent
 
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...HostedbyConfluent
 
nuclio Overview October 2017
nuclio Overview October 2017nuclio Overview October 2017
nuclio Overview October 2017iguazio
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
 
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...confluent
 
iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)Eran Duchan
 
Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Timothy Spann
 
[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQLWSO2
 

Similaire à Embeddable data transformation for real time streams (20)

Streaming ETL for All
Streaming ETL for AllStreaming ETL for All
Streaming ETL for All
 
Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
 
Spark+flume seattle
Spark+flume seattleSpark+flume seattle
Spark+flume seattle
 
Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015Building a system for machine and event-oriented data - Data Day Seattle 2015
Building a system for machine and event-oriented data - Data Day Seattle 2015
 
Building a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaBuilding a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with Rocana
 
Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer Building an Event-oriented Data Platform with Kafka, Eric Sammer
Building an Event-oriented Data Platform with Kafka, Eric Sammer
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4
 
Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...Building a system for machine and event-oriented data - Velocity, Santa Clara...
Building a system for machine and event-oriented data - Velocity, Santa Clara...
 
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
 
Perfect Norikra 2nd Season
Perfect Norikra 2nd SeasonPerfect Norikra 2nd Season
Perfect Norikra 2nd Season
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?
 
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
Wikipedia’s Event Data Platform, Or: JSON Is Okay Too With Andrew Otto | Curr...
 
nuclio Overview October 2017
nuclio Overview October 2017nuclio Overview October 2017
nuclio Overview October 2017
 
REST easy with API Platform
REST easy with API PlatformREST easy with API Platform
REST easy with API Platform
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
 
iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)iguazio - nuclio overview to CNCF (Sep 25th 2017)
iguazio - nuclio overview to CNCF (Sep 25th 2017)
 
Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020
 
[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL
 

Plus de Joey Echeverria

The Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityThe Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityJoey Echeverria
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and ClouderaJoey Echeverria
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoopJoey Echeverria
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use casesJoey Echeverria
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itchJoey Echeverria
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computingJoey Echeverria
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real worldJoey Echeverria
 

Plus de Joey Echeverria (10)

Debugging Apache Spark
Debugging Apache SparkDebugging Apache Spark
Debugging Apache Spark
 
The Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityThe Future of Apache Hadoop Security
The Future of Apache Hadoop Security
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and Cloudera
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoop
 
Big data security
Big data securityBig data security
Big data security
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itch
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real world
 

Dernier

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Dernier (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Embeddable data transformation for real time streams

  • 1. © Rocana, Inc. All Rights Reserved. | 1 Joey Echeverria, Platform Technical Lead Strata+Hadoop World, March 31st 2016 San Jose, CA Embeddable data transformation for real-time streams
  • 2. © Rocana, Inc. All Rights Reserved. | 2 http://j.mp/hw-questions Slides http://j.mp/rocana-transform-slides
  • 3. © Rocana, Inc. All Rights Reserved. | 3 http://j.mp/hw-questions Questions http://j.mp/hw-questions
  • 4. © Rocana, Inc. All Rights Reserved. | 4 http://j.mp/hw-questions Context
  • 5. © Rocana, Inc. All Rights Reserved. | 5 http://j.mp/hw-questions Joey • Where I work: Rocana – Platform Technical Lead • Where I used to work: Cloudera (’11-’15), NSA • Distributed systems, security, data processing, big data
  • 6. © Rocana, Inc. All Rights Reserved. | 6 Signing today at 1pm at the Cloudera booth
  • 7. © Rocana, Inc. All Rights Reserved. | 7 http://j.mp/hw-questions History
  • 8. © Rocana, Inc. All Rights Reserved. | 8 http://j.mp/hw-questions Spark Impala “Legacy” data architecture HDFS Avro/Parquet FilesFlume/Sqoop Data Producers MapReduc e Visualization/Query
  • 9. © Rocana, Inc. All Rights Reserved. | 9 http://j.mp/hw-questions Flink Storm Stream data architecture Kafka Avro Serialized Recrods Data Producers Spark Streaming Real-time Visualization HDFS Avro/Parquet FilesKafka Consumers
  • 10. © Rocana, Inc. All Rights Reserved. | 10 http://j.mp/hw-questions Flink Storm Stream data architecture Kafka Avro Serialized Recrods Data Producers Spark Streaming Real-time Visualization HDFS Avro/Parquet FilesKafka Consumers
  • 11. © Rocana, Inc. All Rights Reserved. | 11 http://j.mp/hw-questions Stream processing A primer
  • 12. © Rocana, Inc. All Rights Reserved. | 12 http://j.mp/hw-questions Stream processing • Filter • Extract • Project • Aggregate • Join • Model
  • 13. © Rocana, Inc. All Rights Reserved. | 13 http://j.mp/hw-questions Stream processing • Filter • Extract • Project • Aggregate • Join • Model
  • 14. © Rocana, Inc. All Rights Reserved. | 14 http://j.mp/hw-questions Stream processing • Filter • Extract • Project • Aggregate • Join • Model • Data transformation
  • 15. © Rocana, Inc. All Rights Reserved. | 15 http://j.mp/hw-questions Apache Storm • "Distributed real-time computation system" • Applications packaged into topologies (think MapReduce job) • Topologies operate over streams of tuples • Spout: source of a stream • Bolt: arbitrary operation such as filtering, aggregating, joining, or executing arbitrary functions
  • 16. © Rocana, Inc. All Rights Reserved. | 16 http://j.mp/hw-questions Apache Spark • Supports batch and stream processing • Continuous stream of records discretized into a DStream • DStream: a sequence of RDDs (batches of records) • Micro-batch
  • 17. © Rocana, Inc. All Rights Reserved. | 17 http://j.mp/hw-questions Apache Flink • Supports batch and stream processing • DataStream: unbounded collection of records • Operations can apply to individual records or windows of records • Supports record-at-a-time processing (like Storm)
  • 18. © Rocana, Inc. All Rights Reserved. | 18 http://j.mp/hw-questions Apache Kafka • Pub-sub messaging system implemented as a distributed commit log • Popular as a source and sink for data streams • Scalability, durability, and easy-to-understand delivery guarantees • Can do stream processing directly in Kafka consumers
  • 19. © Rocana, Inc. All Rights Reserved. | 19 http://j.mp/hw-questions Data transformation
  • 20. © Rocana, Inc. All Rights Reserved. | 20 http://j.mp/hw-questions Filter filter
  • 21. © Rocana, Inc. All Rights Reserved. | 21 http://j.mp/hw-questions Extract 127.0.0.1 Mozilla/5.0 laura [31/Mar/2016] "GET /index.html HTTP/1.0" 200 2326 ts: 1436576671000 body: <binary blob> event_type_id: 100 ... extract ts: 1436576671000 body: <binary blob> event_type_id: 100 attributes: { ip: "127.0.0.1" user_agent: "Mozilla/5.0" user_id: "laura" date: "[31/March/2016]" request: "GET /index.html HTTP/1.0" status_code: "200" size: "2326" }
  • 22. © Rocana, Inc. All Rights Reserved. | 22 http://j.mp/hw-questions Project ts: 1436576671000 body: <binary blob> event_type_id: 100 attributes: { ip: "127.0.0.1" user_agent: "Mozilla/5.0" user_id: "laura" date: "[31/March/2016]" request: "GET /index.html HTTP/1.0" status_code: "200" size: "2326" } ts: 1459444413000 ip: "127.0.0.1" user_agent: "Mozilla/5.0" user_id: "laura" request: "GET /index.html HTTP/1.0" status_code: 200 size: 2326 project
  • 23. © Rocana, Inc. All Rights Reserved. | 23 http://j.mp/hw-questions Problem
  • 24. © Rocana, Inc. All Rights Reserved. | 24 http://j.mp/hw-questions Who • Developers • Data engineers • Sysadmins • Analysts
  • 25. © Rocana, Inc. All Rights Reserved. | 25 http://j.mp/hw-questions Tools
  • 26. © Rocana, Inc. All Rights Reserved. | 26 http://j.mp/hw-questions The dark art of data science • Feature engineering • “Getting a mess of raw data that can be used as input to a machine learning algorithm” - @josh_wills • Video from Midwest.io 2014
  • 27. © Rocana, Inc. All Rights Reserved. | 27 http://j.mp/hw-questions Data transformation for all
  • 28. © Rocana, Inc. All Rights Reserved. | 28 http://j.mp/hw-questions Rocana Transform • Library • Java • Rocana configuration • JSON + comments + specific numeric types - excess quoting
  • 29. © Rocana, Inc. All Rights Reserved. | 29 http://j.mp/hw-questions Data model • Event schema • id: A globally unique identifier for this event • ts: Epoch timestamp in milliseconds • event_type_id: ID indicating the type of the event • location: Location from which the event was generated • host: Hostname, IP, or other device identifier from which the event was generated • service: Service or process from which the event was generated • body: Raw event content in bytes • attributes: Event type-specific key/value pairs
  • 30. © Rocana, Inc. All Rights Reserved. | 30 http://j.mp/hw-questions Example event { "id": "JRHAIDMLCKLEAPMIQDHFLO3MXYXV7NVBEJNDKZGS2XVSEINGGBHA====", "event_type_id": 100, "ts": 1436576671000, "location": "aws/us-west-2a", "host": "example01.rocana.com", "service": "dhclient", "body": "<36>Jul 10 18:04:31 gs09.example.com dhclient[865] DHCPACK from …", "attributes": { "syslog_timestamp": "1436576671000", "syslog_process": "dhclient", "syslog_pid": "865", "syslog_facility": "3", "syslog_severity": "6", "syslog_hostname": "example01", "syslog_message": "DHCPACK from 10.10.1.1 (xid=0x5c64bdb0)" } }
  • 31. © Rocana, Inc. All Rights Reserved. | 31 http://j.mp/hw-questions Filter, extract, and flatten
  • 32. © Rocana, Inc. All Rights Reserved. | 32 http://j.mp/hw-questions Filter, extract, and flatten • Filter out events without type id 100 • Filter out events without hostname prefix "ex" • Extract a numeric prefix from the syslog message • Flatten syslog attributes to top-level fields in a different avro schema
  • 33. © Rocana, Inc. All Rights Reserved. | 33 http://j.mp/hw-questions Filter, extract, and flatten { load-event: {}, // Filter by event_type_id filter: { expression: "${event_type_id == 100}" }, // Extract hostname prefix regex: { ... }, filter: { expression: "${host_prefix.match.group.1 == 'ex'}", // Extract a numeric prefix from the syslog message regex: { ... }, // Build flattened record build-avro-record: { ... }, // Accumulate output record accumulate-output: { value: "${output_record}" } }
  • 34. © Rocana, Inc. All Rights Reserved. | 34 http://j.mp/hw-questions Extract hostname prefix { load-event: {}, filter: { expression: "${event_type_id == 100}" }, regex: { pattern: "^(.{2}).*$", value: "${attr.syslog_hostname}", destination: "host_prefix" }, filter: { expression: "${host_prefix.match.group.1 == 'ex'}", ... }
  • 35. © Rocana, Inc. All Rights Reserved. | 35 http://j.mp/hw-questions Extract numeric prefix ... filter: { expression: "${host_prefix.match.group.1 == 'ex'}", regex: { pattern: "^([0-9]*)", value: "${attributes['syslog_message']}", destination: "msg", match-actions: { set-values: { extracted_field: "${msg.match.group.1}" } }, no-match-actions: { set-values: { extracted_field: "" } } }, ...
  • 36. © Rocana, Inc. All Rights Reserved. | 36 http://j.mp/hw-questions Build flattened record ... build-avro-record: { schema-uri: "resource:avro-schemas/flattened-syslog.avsc", destination: "output_record", field-mapping: { ts: "${ts}", event_type_id: "${event_type_id}", source: "${source}", syslog_facility: "${convert:toInt(attributes['syslog_facility'])}", syslog_severity: "${convert:toInt(attributes['syslog_severity'])}", ... syslog_message: "${attributes['syslog_message']}", syslog_pid: "${convert:toInt(attributes['syslog_pid)}", extracted_field: "${extracted_field}" }, }, ...
  • 37. © Rocana, Inc. All Rights Reserved. | 37 http://j.mp/hw-questions Extract metrics from log data
  • 38. © Rocana, Inc. All Rights Reserved. | 38 http://j.mp/hw-questions Extract metrics • Input: HTTP status logs • Extract request latency • Extract counts by HTTP status code • Metric types • Guage: A value that varies over time (think latency, CPU %, etc.) • Counter: A value that accumulates over time (think event volume, status codes, etc.)
  • 39. © Rocana, Inc. All Rights Reserved. | 39 http://j.mp/hw-questions Example metric event { "id": "JRHAIDMLCKLEAPMIQDHFLO3MXBBQ7NVBEJNDKZGS2XVSEINGGBHA====", "event_type_id": 107, "ts": 1436576671000, "location": "aws/us-west-2a", "host": "web01.rocana.com", "service": "httpd", "attributes": { "m.http.request.latency": "4.2000000000E1|g", "m.http.status.401.count": "1.0000000000E0|c", } }
  • 40. © Rocana, Inc. All Rights Reserved. | 40 http://j.mp/hw-questions Extract metrics { load-event: {}, build-metric: { gauge-mapping: { http.request.latency: "${convert:toDouble(attributes['latency'])}" }, destination: "latency_metric" }, accumulate-output: { value: "${latency_metric}" }, build-metric: { dynamic-counter-mapping: [ "${string:format('http.status.%s.count', attributes['sc_status'])}", 1D ], destination: "status_metric" }, accumulate-output: { value: "${status_metric}" } }
  • 41. © Rocana, Inc. All Rights Reserved. | 41 http://j.mp/hw-questions Architecture
  • 42. © Rocana, Inc. All Rights Reserved. | 42 http://j.mp/hw-questions Java action objects Architecture Configuration file Java action objects Context Variables Driver 1. Parse config 2. Initialize context 5. Copy output 3. Execute actions 4. Read/write variables
  • 43. © Rocana, Inc. All Rights Reserved. | 43 http://j.mp/hw-questions Custom actions • Actions loaded at runtime using Java services framework • Add your jar to the classpath • Custom actions appear as top-level keywords just like regular actions • Implement the execute() method of the Action interface • Implement the build() method of the ActionBuilder interface
  • 44. © Rocana, Inc. All Rights Reserved. | 44 http://j.mp/hw-questions Custom actions • Parse custom log formats • Cisco ACS • Citrix • Juniper • Customer-specific formats • Lookup IP addresses in the MaxMind GeoIP2 database • Reference dataset lookups • Device id to device name
  • 45. © Rocana, Inc. All Rights Reserved. | 45 http://j.mp/hw-questions Putting it all together • Stream processing is causing us to re-think how we analyze data • Limiting accessibility of data transformation side increases costs and decreases velocity • Reduce your reliance on developers to code custom pipelines • Re-use transformation configuration in any stream processing framework or batch job
  • 46. © Rocana, Inc. All Rights Reserved. | 46 http://j.mp/hw-questions Coming soon • Rocana transform will be released under the ASL 2.0 • The base configuration library is available today: • https://github.com/scalingdata/rocana-configuration
  • 47. © Rocana, Inc. All Rights Reserved. | 47 http://j.mp/hw-questions Questions? • Signing "Hadoop Security" today at 1pm at the Cloudera booth