SlideShare une entreprise Scribd logo
1  sur  101
Télécharger pour lire hors ligne
© 2017 MapR Technologies
Data Pipeline Using Apache APIs: Kafka,
Spark, and MapR-DB
© 2017 MapR Technologies
Data Pipeline Using Apache APIs: Kafka, Spark, and MapR-DB
•  Kafka
•  Spark Streaming
•  Spark SQL
© 2017 MapR Technologies
Streaming ETL Pipeline
Data Collect Process Store
Stream
Topic
Spark
Streaming
Kafka API
SQL
Open
JSON
API
Analyze
JSON
SQL
© 2017 MapR Technologies
Traditional ETL
Image Reference: Databricks
© 2017 MapR Technologies
Streaming ETL
Image Reference: Databricks
© 2017 MapR Technologies
What is a Stream ?
Producers Consumers
•  A stream is an continuous sequence of events or records
•  Records are key-value pairs
Stream of Data
key value key value key value key value
© 2017 MapR Technologies
Examples of Streaming Data
Fraud detection Smart Machinery Smart Meters Home Automation
Networks Manufacturing Security Systems Patient Monitoring
© 2017 MapR Technologies
Examples of Streaming Data
•  Monitoring devices combined with ML can provide alerts for Sepsis,
which is one of the leading causes for death in hospitals
–  http://www.computerweekly.com/news/450422258/Putting-sepsis-algorithms-into-electronic-
patient-records
© 2017 MapR Technologies
Examples of Streaming Data
•  A Stanford team has shown that a machine-learning model can identify heart
arrhythmias from an electrocardiogram (ECG) better than an expert
–  https://www.technologyreview.com/s/608234/the-machines-are-getting-ready-to-play-doctor/
© 2017 MapR Technologies
Applying Machine Learning to Live Patient Data
•  https://www.slideshare.net/caroljmcdonald/applying-machine-learning-to-
live-patient-data
© 2017 MapR Technologies
What has changed in the past 10 years?
•  Distributed computing
•  Streaming analytics
•  Improved machine learning
© 2017 MapR Technologies
Serve DataStore DataCollect Data
What Do We Need to Do ?
Process DataData Sources
? ? ? ?
© 2017 MapR Technologies
Collect the Data
Data IngestSource
Stream
Topic
•  Data Ingest:
–  Using the Kafka API
© 2017 MapR Technologies
Organize Data into Topics with MapR-Event Streams
Topics: Logical collection of events, Organize Events into Categories
Consumers
MapR Cluster
Topic: Pressure
Topic: Temperature
Topic: Warnings
Consumers
Consumers
Kafka API Kafka API
© 2017 MapR Technologies
Scalable Messaging with MapR Event Streams
Server 1
Partition1: Topic - Pressure
Partition1: Topic - Temperature
Partition1: Topic - Warning
Server 2
Partition2: Topic - Pressure
Partition2: Topic - Temperature
Partition2: Topic - Warning
Server 3
Partition3: Topic - Pressure
Partition3: Topic - Temperature
Partition3: Topic - Warning
Topics are
partitioned for
throughput and
scalability
© 2017 MapR Technologies
Scalable Messaging with MapR Event Streams
Partition1: Topic - Pressure
Partition1: Topic - Temperature
Partition1: Topic - Warning
Partition2: Topic - Pressure
Partition2: Topic - Temperature
Partition2: Topic - Warning
Partition3: Topic - Pressure
Partition3: Topic - Temperature
Partition3: Topic - Warning
Producers are load
balanced between
partitions
Kafka API
© 2017 MapR Technologies
Scalable Messaging with MapR Event Streams
Partition1: Topic - Pressure
Partition1: Topic - Temperature
Partition1: Topic - Warning
Partition2: Topic - Pressure
Partition2: Topic - Temperature
Partition2: Topic - Warning
Partition3: Topic - Pressure
Partition3: Topic - Temperature
Partition3: Topic - Warning
Consumers
Consumers
Consumers
Consumer
groups can
read in
parallel
Kafka API
© 2017 MapR Technologies
Partition is like an Event Log
Consumers
MapR Cluster
Topic: Admission / Server 1
Topic: Admission / Server 2
Topic: Admission / Server 3
Consumers
Consumers
Partition
1
New Messages are
appended to the end
Partition
2
Partition
3
6 5 4 3 2 1
3 2 1
5 4 3 2 1
Producers
Producers
Producers
New
Message
6 5 4 3 2 1
Old
Message
© 2017 MapR Technologies
Partition is like a Queue
Messages are delivered in the order they are received
MapR Cluster
6 5 4 3 2 1
Consumer
groupProducers
Read cursors
Consumer
group
© 2017 MapR Technologies
Unlike a queue, events are still persisted after they’re delivered
Messages remain on the partition, available to other consumers
MapR Cluster (1 Server)
Topic: Warning
Partition
1
3 2 1 Unread Events
Get Unread
3 2 1
Client Library ConsumerPoll
© 2017 MapR Technologies
When Are Messages Deleted?
•  Messages can be persisted forever
•  Or
•  Older messages can be deleted automatically based on time to live
MapR Cluster (1 Server)
6 5 4 3 2 1Partition
1
Older
message
© 2017 MapR Technologies
Traditional Message queue
© 2017 MapR Technologies
How do we do this with High Performance at Scale?
•  Parallel operations
•  minimizes disk read/writes
© 2017 MapR Technologies
Processing Same Message for Different Purposes
Consumers
Consumers
Consumers
Producers
Producers
Producers
MapR-FS
Kafka API Kafka API
© 2017 MapR Technologies
Stream as the System of Record
© 2017 MapR Technologies
A Table is a Snapshot of a Stream
Updates
Imagine each event as a change to an entry in a database.
Account Id Balance
WillO 80.00
BradA 20.00
1: WillO : Deposit : 100.00
2: BradA : Deposit : 50.00
3: BradA : Withdraw : 30.00
4: WillO : Withdraw: 20.00
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
Change log
4 3 2 1
© 2017 MapR Technologies
A Stream is a Change Log of a Table
Change Log
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
3 2 1 3 2 1
3 2 1
Duality of Streams and Tables
Master:
Append writes
Slave:
Apply writes in order
Replication of changes
© 2017 MapR Technologies
Rewind: Reprocessing Events
MapR Cluster
6 5 4 3 2 1Producers
Reprocess from
oldest
Consumer
Create new view, Index, cache
© 2017 MapR Technologies
Rewind Reprocessing Events
MapR Cluster
6 5 4 3 2 1Producers
To Newest
Consumer
new view
Read from
new view
© 2017 MapR Technologies
Event Sourcing, Command Query Responsibility Separation:
Turning the Database Upside Down
Key-Val Document Graph
Wide
Column
Time
Series
Relational
???Events Updates
© 2017 MapR Technologies
Use Case: Streaming System of Record for Healthcare
Objective:
•  Build a flexible, secure
healthcare exchange
Records Analysis
Applications
Challenges:
•  Many different data models
•  Security and privacy issues
•  HIPAA compliance
Records
© 2017 MapR Technologies32
ALLOY Health:
Exchange State HIE
Clinical Data Viewer
Reporting and Analytics
Clinical Data
Financial Data
Provider
Organizations
What are the outcomes in
the entire state on
diabetes?
Are there doctors that are
doing this better than
others?
Georgia Health Connect
© 2017 MapR Technologies
Use Case: Streaming System of Record for Healthcare
© 2017 MapR Technologies
Spark Dataset
© 2017 MapR Technologies
Spark Distributed Datasets
Dataset
W
Executor
P4
W
Executor
P1 P3
W
Executor
P2
partitioned
Partition 1
8213034705, 95,
2.927373,
jake7870, 0……
Partition 2
8213034705,
115, 2.943484,
Davidbresler2,
1….
Partition 3
8213034705,
100, 2.951285,
gladimacowgirl,
58…
Partition 4
8213034705,
117, 2.998947,
daysrus, 95….
•  Read only collection of typed objects
Dataset[T]
•  Partitioned across a cluster
•  Operated on in parallel
•  can be Cached
© 2017 MapR Technologies
val df: Dataset[Payment] = spark.read.json(”/p/file.json").as[Payment]
Spark Distributed Datasets read from a file
Worker
Task
Worker
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
Process
& Cache
Data
Process
& Cache
Data
Process
& Cache
Data
Task
Task
Block 1
Driver
tasks
tasks
tasks
© 2017 MapR Technologies
DataFrame is like a table
Dataset[Row]
row
columns
DataFrame = Dataset[Row] can use Spark SQL
© 2017 MapR Technologies
A Dataset is a collection of typed objects
Dataset[objects]
objects
columns
Dataset[Typed Object] can use Spark SQL and Functions
© 2017 MapR Technologies
Spark Streaming
© 2017 MapR Technologies
Collect Data
Process the Data with Spark Streaming
Process Data
Stream
Topic
•  scalable, high-throughput, stream
processing of live data
© 2017 MapR Technologies
What is a Stream ?
Producers Consumers
•  A stream is an continuous sequence of events or records
•  Records are key-value pairs
Stream of Data
key value key value key value key value
© 2017 MapR Technologies
Data stream Unbounded Table
new data in the
data stream
=
new rows appended
to an unbounded table
Data stream as an unbounded table
Treat Stream as Unbounded Tables
© 2017 MapR Technologies
Spark Distributed Datasets read from Stream partitions
Task
Cache
Process
& Cache
Data
offsets
Stream
partition
Task
Cache
Process
& Cache
Data
Task
Cache
Process
& Cache
Data
Driver
Stream
partition
Stream
partition
Data is cached for
aggregations
And windowed
functions
© 2017 MapR Technologies
Streaming data =
Unbounded table
Static Data =
bounded table
Same Dataset operations & SQL
Stream Processing on Spark SQL Engine
© 2017 MapR Technologies
Conceptual model
incremental
query
3.  Append
© 2017 MapR Technologies
Continuous incremental execution
Spark SQL converts queries
to incremental execution plans
For input of data
Incremental
Incremental
Incremental
© 2017 MapR Technologies
Use Case: Payment Data
Payment input data
Stream
Input
"NEW","Covered Recipient Physician",,,,"132655","GREGG","D","ALZATE",,"8745
AERO DRIVE","STE 200","SAN DIEGO","CA","92123","United States",,,"Medical
Doctor","Allopathic & Osteopathic Physicians|Radiology|Diagnostic
Radiology","CA",,,,,"DFINE, Inc","100000000326","DFINE, Inc","CA","United States",
90.87,"02/12/2016","1","In-kind items and services","Food and Beverage",,,,"No","No
Third Party
Payment",,,,,"No","346039438","No","Yes","Covered","Device","Radiology","StabiliT",
,"Covered","Device","Radiology","STAR Tumor Ablation
System",,,,,,,,,,,,,,,,,"2016","06/30/2017"
transform
Spark
Streaming
{
"_id":"317150_08/26/2016_346122858",
"physician_id":"317150",
"date_payment":"08/26/2016",
"record_id":"346122858",
"payer":"Mission Pharmacal Company",
"amount":9.23,
"Physician_Specialty":"Obstetrics & Gynecology",
"Nature_of_payment":"Food and Beverage"
}
JSON
© 2017 MapR Technologies
Use Case: Open Payment Dataset
•  Payments Drug and Device companies make to
•  Physicians and Teaching Hospitals for
•  Travel, Research, Gifts, Speaking fees, and Meals
© 2017 MapR Technologies
Scenario: Payment Data
Provider
ID
Date Payer Payer
State
Provider
Specialty
Provider
State
Amount Payment
Nature
1261770 01/11/2016
Southern
Anesthesia
& Surgical,
Inc
CO
Oral and
Maxillofacial
Surgery
CA 117.5
Food and
Beverage
© 2017 MapR Technologies
Stream the data into a Dataframe: Define the Schema
case class Payment(physician_id: String,
date_payment: String, payer: String, payer_state: String
amount: Double, physician_specialty: String,
phys_state: String, nature_of_payment:String)
val schema = StructType(Array(
StructField("_id", StringType, true),
StructField("physician_id", StringType, true),
StructField("date_payment", StringType, true),
StructField("payer", StringType, true),
StructField("payer_state", StringType, true),
StructField("amount", DoubleType, true),
StructField("physician_specialty", StringType, true),
StructField("physician_type", StringType, true),
StructField("physician_state", StringType, true),
StructField("nature_of_payment", StringType, true)
))
© 2017 MapR Technologies
Function to Parse CSV into Payment Class
def parse(str: String): Payment = {
val td = str.split(",(?=([^"]*"[^"]*")*[^"]*$)")
val physician_id = td(5)
val payer = td(27)
. . .
val physician_state = td(20)
var focus =td(19)
val id =physician_state+'_’+focus+ '_’+ date_payment+'_'
+ record_id
Payment(id, physician_id, date_payment, payer, payer_state,
amount, physician_type, focus, physician_state,
nature_of_payment)
		}
© 2017 MapR Technologies
Parsed and Transformed Payment Data
{
"_id":”TX_Gynecology_08/26/2016_346122858",
"physician_id":"317150",
"date_payment":"08/26/2016",
"payer":"Mission Pharmacal Company",
"payer_state":”CO",
"amount":9.23,
”physician_specialty":”Gynecology",
“physician_state":”TX"
”nature_of_payment":"Food and Beverage"
}
Example Dataset Row
© 2017 MapR Technologies
Streaming pipeline Data source
Specify data source
returns a dataframe
val	df1	=	spark.readStream.format("kafka")		
	.option("kafka.boostrap.servers",...)		
	.option("subscribe",	"topic")	
	.load()
© 2017 MapR Technologies
Transformation
Cast bytes from Kafka records
to a string, parse csv , and
return Dataset[Payment]
spark.udf.register("deserialize",
(message: String) => parse(message))
val df2=df1
.selectExpr("""deserialize(CAST(value as STRING))
AS message""")
.select($"message".as[Payment])
© 2017 MapR Technologies
Streams
Stream Processing
Stream
Processing
Storage
Raw
Enriched
Filtered
Stream Processing:
•  Filtering
•  Transformations
•  Aggregations
•  Enrichments with
ML
•  Enrichments with
joins
MapR-DB
MapR-XD
© 2017 MapR Technologies
Dataframe Integrated Queries
L I Query Description
agg(expr, exprs) Aggregates on entire DataFrame
distinct Returns new DataFrame with unique rows
except(other) Returns new DataFrame with rows from this DataFrame not in
other DataFrame
filter(expr);
where(condition)
Filter based on the SQL expression or condition
groupBy(cols:
Columns)
Groups DataFrame using specified columns
join (DataFrame,
joinExpr)
Joins with another DataFrame using given join expression
sort(sortcol) Returns new DataFrame sorted by specified column
select(col) Selects set of columns
© 2017 MapR Technologies
Continuous aggregations
Continuously compute average payment amountval	d3=df2.avg(“amount")
© 2017 MapR Technologies
Continuous aggregations and filter
val	d3=df2.groupBy(“payer")	
					.avg(“amount")
© 2017 MapR Technologies
Continuous aggregations and filter
val	d3=df2	
					.filter($"amount"	>	20000)
© 2017 MapR Technologies
Streaming pipeline Kafka topic Data Sink
Write results to Kafka topic
Start running the query
val	query	=	df3.write	
				.format("kafka")	
				.option("kafka.bootstrap.servers",	"host1:port1,host2:port2")	
				.option("topic",	"/apps/uberstream:uberp")															
	
query.start.awaitTermination()
© 2017 MapR Technologies
Streaming Applicaton
Streaming Dataset
Streaming Dataset
Transformed
Stream
Topic
Stream
Topic
Stream
Topic
Stream
Topic
Stream
Topic
Stream
Topic
© 2017 MapR Technologies
Spark & MapR-DB
© 2017 MapR Technologies
AnalyzeStore DataCollect Data
What Do We Need to Do ?
Process DataData Sources
? ? ? ?
Stream
Topic
Spark
Streaming
JSON
© 2017 MapR Technologies
Stream Processing Pipeline
Data Collect Process Store
Stream
Topic
Spark
Streaming
Kafka API
SQL
Open
JSON
API
Analyze
JSON
SQL
© 2017 MapR Technologies
Where/How to store data ?
© 2017 MapR Technologies
Relational Database vs. MapR-DB
bottleneck
Storage ModelRDBMS MapR-DB
Normalized schema à Joins for
queries can cause bottleneck De-Normalized schema à Data that
is read together is stored together
Key colB colC
xxx val val
xxx val val
Key colB colC
xxx val val
xxx val val
Key colB colC
xxx val val
xxx val val
© 2017 MapR Technologies
MapR-DB JSON Document Store
Data that is read together is stored
together
© 2017 MapR Technologies
Designed for Partitioning and Scaling
Key
Range
xxxx
xxxx
Key
Range
xxxx
xxxx
Key
Range
xxxx
xxxx
Fast Reads and Writes by Key! Data is automatically partitioned
by Key Range!
Key colB colC
xxx val val
xxx val val
Key colB colC
xxx val val
xxx val val
Key colB colC
xxx val val
xxx val val
© 2017 MapR Technologies
MapR-DB JSON Document Store
Data is automatically partitioned
by Key Range!
© 2017 MapR Technologies
Payment Data
Automatically sorted and
partitioned by Key Range
(_id)
{
"_id":”TX_Gynecology_08/26/2016_346122858",
"physician_id":"317150",
"date_payment":"08/26/2016",
"payer":"Mission Pharmacal Company",
"payer_state":”CO",
"amount":9.23,
”physician_specialty":”Gynecology",
“physician_state":”TX"
”nature_of_payment":"Food and Beverage"
}
© 2017 MapR Technologies
Spark Streaming writing to MapR-DB JSON
© 2017 MapR Technologies
Spark MapR-DB Connector
•  Connection object in every Spark Executor:
•  distributed parallel writes & reads
© 2017 MapR Technologies
Use Case: Flight Delays
Payment input data
Stream
Input
"NEW","Covered Recipient Physician",,,,"132655","GREGG","D","ALZATE",,"8745
AERO DRIVE","STE 200","SAN DIEGO","CA","92123","United States",,,"Medical
Doctor","Allopathic & Osteopathic Physicians|Radiology|Diagnostic
Radiology","CA",,,,,"DFINE, Inc","100000000326","DFINE, Inc","CA","United States",
90.87,"02/12/2016","1","In-kind items and services","Food and Beverage",,,,"No","No
Third Party
Payment",,,,,"No","346039438","No","Yes","Covered","Device","Radiology","StabiliT",
,"Covered","Device","Radiology","STAR Tumor Ablation
System",,,,,,,,,,,,,,,,,"2016","06/30/2017"
transform
Spark
Streaming
{
"_id":"317150_08/26/2016_346122858",
"physician_id":"317150",
"date_payment":"08/26/2016",
"record_id":"346122858",
"payer":"Mission Pharmacal Company",
"amount":9.23,
"Physician_Specialty":"Obstetrics & Gynecology",
"Nature_of_payment":"Food and Beverage"
}
JSON
© 2017 MapR Technologies
Streaming pipeline Data Sink
Write result to maprdb
Start running the query
val	query	=	df2.writeStream	
						.format(MapRDBSourceConfig.Format)	
						.option(MapRDBSourceConfig.TablePathOption, "/apps/paytable")	
						.option(MapRDBSourceConfig.CreateTableOption,	false)	
						.option(MapRDBSourceConfig.IdFieldPathOption,	"value")	
						.outputMode(”append")	
																				
query.start().awaitTermination()
© 2017 MapR Technologies
Streaming Applicaton
Streaming Dataset
Streaming Dataset
Transformed
Tablets
Stream
Topic
Stream
Topic
Stream
Topic
Data is rapidly
available for complex,
ad-hoc analytics
© 2017 MapR Technologies
Explore the Data With Spark SQL
© 2017 MapR Technologies
AnalyzeStore DataCollect Data
What Do We Need to Do ?
Process DataData Sources
? ? ?
Stream
Topic
Spark
Streaming
JSON
SQL
© 2017 MapR Technologies
Spark SQL Querying MapR-DB JSON
© 2017 MapR Technologies
Data
Frame
Load data
Load the data into a Dataframe
val pdf: Dataset[Payment] =
spark.loadFromMapRDB[Payment]("/apps/paytable", schema)
.as[Payment]
pdf.select("_id", "payer", "amount”).show
© 2017 MapR Technologies
val pdf: Dataset[Payment] =
spark.loadFromMapRDB[Payment]("/apps/paytable", schema)
.as[Payment]
Spark Distributed Datasets read from MapR-DB Partitions
Worker
Task
Worker
Driver
Cache 1
Cache 2
Cache 3
Process
& Cache
Data
Process
& Cache
Data
Process
& Cache
Data
Task
Task
Driver
tasks
tasks
tasks
© 2017 MapR Technologies
Language Integrated Queries
L I Query Description
agg(expr, exprs) Aggregates on entire DataFrame
distinct Returns new DataFrame with unique rows
except(other) Returns new DataFrame with rows from this DataFrame not in
other DataFrame
filter(expr);
where(condition)
Filter based on the SQL expression or condition
groupBy(cols:
Columns)
Groups DataFrame using specified columns
join (DataFrame,
joinExpr)
Joins with another DataFrame using given join expression
sort(sortcol) Returns new DataFrame sorted by specified column
select(col) Selects set of columns
© 2017 MapR Technologies
Top 5 Nature of Payment by count
val res = pdf.groupBy(”Nature_of_Payment")
.count()
.orderBy(desc(count))
.show(5)
© 2017 MapR Technologies
Top 5 Nature of Payment by amount of payment
%sql select Nature_of_payment,
sum(amount) as total from payments
group by Nature_of_payment order by total desc limit 5
© 2017 MapR Technologies
What are the Nature of Payments with payments > $1000 with count
pdf.filter($"amount" > 1000)
.groupBy("Nature_of_payment")
.count().orderBy(desc("count")).show()
© 2017 MapR Technologies
Top 5 Physician Specialties by total Amount
%sql select physician_specialty, sum(amount) as total
from payments where physician_specialty IS NOT NULL
group by physician_specialty order by total desc limit 5
© 2017 MapR Technologies
Average Payment by Specialty
© 2017 MapR Technologies
Top Payers by Total Amount with count
%sql select payer, payer_state, count(*) as cnt,
sum(amount) as total from payments
group by payer, payer_state order by total desc limit 10
© 2017 MapR Technologies
Stream Processing
Building a Complete Data Architecture
MapR File System
(MapR-XD)
MapR Converged Data Platform
MapR Database
(MapR-DB)
MapR Event Streams
Sources/Apps Bulk Processing
All of these components can run on the same cluster with the MapR Converged platform.
© 2017 MapR Technologies
Data Pipelines and Machine Learning Logistics
Input Data +
Actual Delay
Input Data +
Predictions
Consumer
withML
Model 2
Consumer
withML
Model 1
Decoy
results
Consumer
Consumer
withML
Model 3
Consumer
Stream
Archive
Stream
Scores
Stream
Input
SQL
SQL
Real time
Flight Data
Stream
Input
Actual Delay
Input Data +
Predictions +
Actual Delay
Real Time
dashboard +
Historical
Analysis
© 2017 MapR Technologies
© 2017 MapR Technologies
To Learn More:
•  MapR Free ODT http://learn.mapr.com/
© 2017 MapR Technologies
…helping you put data technology to work
●  Find answers
●  Ask technical questions
●  Join on-demand training course
discussions
●  Follow release announcements
●  Share and vote on product ideas
●  Find Meetup and event listings
Connect with fellow Apache
Hadoop and Spark professionals
community.mapr.com
© 2017 MapR Technologies
MapR Blog
• https://www.mapr.com/blog/
© 2017 MapR Technologies
To Learn More: ETL Payment data pipeline
•  https://mapr.com/blog/etl-pipeline-healthcare-dataset-with-spark-json-mapr-
db/
•  https://mapr.com/blog/streaming-data-pipeline-transform-store-explore-
healthcare-dataset-mapr-db/
© 2017 MapR Technologies
To Learn More:
•  https://mapr.com/blog/how-stream-first-architecture-patterns-are-
revolutionizing-healthcare-platforms/
© 2017 MapR Technologies
To Learn More:
•  https://mapr.com/blog/ml-iot-connected-medical-devices/
© 2017 MapR Technologies
Applying Machine Learning to Live Patient Data
•  https://www.slideshare.net/caroljmcdonald/applying-machine-learning-to-
live-patient-data
© 2017 MapR Technologies
MapR Container for Developers
• https://maprdocs.mapr.com/home/MapRContainerDevelopers/
MapRContainerDevelopersOverview.html
© 2017 MapR Technologies
MapR Data Science Refinery
• https://mapr.com/products/data-science-refinery/
© 2017 MapR Technologies
MapR Data Platform
© 2017 MapR Technologies
Q&A
ENGAGE WITH US

Contenu connexe

Tendances

Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Carol McDonald
 
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Carol McDonald
 
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Carol McDonald
 
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Carol McDonald
 
Demystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningDemystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningCarol McDonald
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Carol McDonald
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataCarol McDonald
 
Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUsCarol McDonald
 
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...MapR Technologies
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APICarol McDonald
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on HadoopCarol McDonald
 
Free Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache SparkFree Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache SparkMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 

Tendances (20)

Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
 
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures
 
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
 
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
 
Demystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningDemystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep Learning
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUs
 
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
Big Data Paris
Big Data ParisBig Data Paris
Big Data Paris
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
Free Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache SparkFree Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache Spark
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 

Similaire à Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB

Episode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data ServicesEpisode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data ServicesMesosphere Inc.
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
Dataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice WayDataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice WayJosef Adersberger
 
Introduction to Streaming with Apache Flink
Introduction to Streaming with Apache FlinkIntroduction to Streaming with Apache Flink
Introduction to Streaming with Apache FlinkTugdual Grall
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureVARUN SAXENA
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureVARUN SAXENA
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudScott Miao
 
Episode 4: Operating Kubernetes at Scale with DC/OS
Episode 4: Operating Kubernetes at Scale with DC/OSEpisode 4: Operating Kubernetes at Scale with DC/OS
Episode 4: Operating Kubernetes at Scale with DC/OSMesosphere Inc.
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationInside Analysis
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingPalani Kumar
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...DataStax Academy
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network ProcessingRyousei Takano
 
Modernizing upstream workflows with aws storage - john mallory
Modernizing upstream workflows with aws storage -  john malloryModernizing upstream workflows with aws storage -  john mallory
Modernizing upstream workflows with aws storage - john malloryAmazon Web Services
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaDataWorks Summit
 

Similaire à Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB (20)

Episode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data ServicesEpisode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data Services
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Dataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice WayDataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice Way
 
Introduction to Streaming with Apache Flink
Introduction to Streaming with Apache FlinkIntroduction to Streaming with Apache Flink
Introduction to Streaming with Apache Flink
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloud
 
Episode 4: Operating Kubernetes at Scale with DC/OS
Episode 4: Operating Kubernetes at Scale with DC/OSEpisode 4: Operating Kubernetes at Scale with DC/OS
Episode 4: Operating Kubernetes at Scale with DC/OS
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter Integration
 
Enterprise Data Lakes
Enterprise Data LakesEnterprise Data Lakes
Enterprise Data Lakes
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_Computing
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
Modernizing upstream workflows with aws storage - john mallory
Modernizing upstream workflows with aws storage -  john malloryModernizing upstream workflows with aws storage -  john mallory
Modernizing upstream workflows with aws storage - john mallory
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 

Plus de Carol McDonald

Spark machine learning predicting customer churn
Spark machine learning predicting customer churnSpark machine learning predicting customer churn
Spark machine learning predicting customer churnCarol McDonald
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Carol McDonald
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine LearningCarol McDonald
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBaseCarol McDonald
 
Machine Learning Recommendations with Spark
Machine Learning Recommendations with SparkMachine Learning Recommendations with Spark
Machine Learning Recommendations with SparkCarol McDonald
 
Getting started with HBase
Getting started with HBaseGetting started with HBase
Getting started with HBaseCarol McDonald
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill Carol McDonald
 

Plus de Carol McDonald (8)

Spark machine learning predicting customer churn
Spark machine learning predicting customer churnSpark machine learning predicting customer churn
Spark machine learning predicting customer churn
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBase
 
Machine Learning Recommendations with Spark
Machine Learning Recommendations with SparkMachine Learning Recommendations with Spark
Machine Learning Recommendations with Spark
 
CU9411MW.DOC
CU9411MW.DOCCU9411MW.DOC
CU9411MW.DOC
 
Getting started with HBase
Getting started with HBaseGetting started with HBase
Getting started with HBase
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
 

Dernier

How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...software pro Development
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionOnePlan Solutions
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfVishalKumarJha10
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...kalichargn70th171
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 

Dernier (20)

How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 

Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB

  • 1. © 2017 MapR Technologies Data Pipeline Using Apache APIs: Kafka, Spark, and MapR-DB
  • 2. © 2017 MapR Technologies Data Pipeline Using Apache APIs: Kafka, Spark, and MapR-DB •  Kafka •  Spark Streaming •  Spark SQL
  • 3. © 2017 MapR Technologies Streaming ETL Pipeline Data Collect Process Store Stream Topic Spark Streaming Kafka API SQL Open JSON API Analyze JSON SQL
  • 4. © 2017 MapR Technologies Traditional ETL Image Reference: Databricks
  • 5. © 2017 MapR Technologies Streaming ETL Image Reference: Databricks
  • 6. © 2017 MapR Technologies What is a Stream ? Producers Consumers •  A stream is an continuous sequence of events or records •  Records are key-value pairs Stream of Data key value key value key value key value
  • 7. © 2017 MapR Technologies Examples of Streaming Data Fraud detection Smart Machinery Smart Meters Home Automation Networks Manufacturing Security Systems Patient Monitoring
  • 8. © 2017 MapR Technologies Examples of Streaming Data •  Monitoring devices combined with ML can provide alerts for Sepsis, which is one of the leading causes for death in hospitals –  http://www.computerweekly.com/news/450422258/Putting-sepsis-algorithms-into-electronic- patient-records
  • 9. © 2017 MapR Technologies Examples of Streaming Data •  A Stanford team has shown that a machine-learning model can identify heart arrhythmias from an electrocardiogram (ECG) better than an expert –  https://www.technologyreview.com/s/608234/the-machines-are-getting-ready-to-play-doctor/
  • 10. © 2017 MapR Technologies Applying Machine Learning to Live Patient Data •  https://www.slideshare.net/caroljmcdonald/applying-machine-learning-to- live-patient-data
  • 11. © 2017 MapR Technologies What has changed in the past 10 years? •  Distributed computing •  Streaming analytics •  Improved machine learning
  • 12. © 2017 MapR Technologies Serve DataStore DataCollect Data What Do We Need to Do ? Process DataData Sources ? ? ? ?
  • 13. © 2017 MapR Technologies Collect the Data Data IngestSource Stream Topic •  Data Ingest: –  Using the Kafka API
  • 14. © 2017 MapR Technologies Organize Data into Topics with MapR-Event Streams Topics: Logical collection of events, Organize Events into Categories Consumers MapR Cluster Topic: Pressure Topic: Temperature Topic: Warnings Consumers Consumers Kafka API Kafka API
  • 15. © 2017 MapR Technologies Scalable Messaging with MapR Event Streams Server 1 Partition1: Topic - Pressure Partition1: Topic - Temperature Partition1: Topic - Warning Server 2 Partition2: Topic - Pressure Partition2: Topic - Temperature Partition2: Topic - Warning Server 3 Partition3: Topic - Pressure Partition3: Topic - Temperature Partition3: Topic - Warning Topics are partitioned for throughput and scalability
  • 16. © 2017 MapR Technologies Scalable Messaging with MapR Event Streams Partition1: Topic - Pressure Partition1: Topic - Temperature Partition1: Topic - Warning Partition2: Topic - Pressure Partition2: Topic - Temperature Partition2: Topic - Warning Partition3: Topic - Pressure Partition3: Topic - Temperature Partition3: Topic - Warning Producers are load balanced between partitions Kafka API
  • 17. © 2017 MapR Technologies Scalable Messaging with MapR Event Streams Partition1: Topic - Pressure Partition1: Topic - Temperature Partition1: Topic - Warning Partition2: Topic - Pressure Partition2: Topic - Temperature Partition2: Topic - Warning Partition3: Topic - Pressure Partition3: Topic - Temperature Partition3: Topic - Warning Consumers Consumers Consumers Consumer groups can read in parallel Kafka API
  • 18. © 2017 MapR Technologies Partition is like an Event Log Consumers MapR Cluster Topic: Admission / Server 1 Topic: Admission / Server 2 Topic: Admission / Server 3 Consumers Consumers Partition 1 New Messages are appended to the end Partition 2 Partition 3 6 5 4 3 2 1 3 2 1 5 4 3 2 1 Producers Producers Producers New Message 6 5 4 3 2 1 Old Message
  • 19. © 2017 MapR Technologies Partition is like a Queue Messages are delivered in the order they are received MapR Cluster 6 5 4 3 2 1 Consumer groupProducers Read cursors Consumer group
  • 20. © 2017 MapR Technologies Unlike a queue, events are still persisted after they’re delivered Messages remain on the partition, available to other consumers MapR Cluster (1 Server) Topic: Warning Partition 1 3 2 1 Unread Events Get Unread 3 2 1 Client Library ConsumerPoll
  • 21. © 2017 MapR Technologies When Are Messages Deleted? •  Messages can be persisted forever •  Or •  Older messages can be deleted automatically based on time to live MapR Cluster (1 Server) 6 5 4 3 2 1Partition 1 Older message
  • 22. © 2017 MapR Technologies Traditional Message queue
  • 23. © 2017 MapR Technologies How do we do this with High Performance at Scale? •  Parallel operations •  minimizes disk read/writes
  • 24. © 2017 MapR Technologies Processing Same Message for Different Purposes Consumers Consumers Consumers Producers Producers Producers MapR-FS Kafka API Kafka API
  • 25. © 2017 MapR Technologies Stream as the System of Record
  • 26. © 2017 MapR Technologies A Table is a Snapshot of a Stream Updates Imagine each event as a change to an entry in a database. Account Id Balance WillO 80.00 BradA 20.00 1: WillO : Deposit : 100.00 2: BradA : Deposit : 50.00 3: BradA : Withdraw : 30.00 4: WillO : Withdraw: 20.00 https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying Change log 4 3 2 1
  • 27. © 2017 MapR Technologies A Stream is a Change Log of a Table Change Log https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying 3 2 1 3 2 1 3 2 1 Duality of Streams and Tables Master: Append writes Slave: Apply writes in order Replication of changes
  • 28. © 2017 MapR Technologies Rewind: Reprocessing Events MapR Cluster 6 5 4 3 2 1Producers Reprocess from oldest Consumer Create new view, Index, cache
  • 29. © 2017 MapR Technologies Rewind Reprocessing Events MapR Cluster 6 5 4 3 2 1Producers To Newest Consumer new view Read from new view
  • 30. © 2017 MapR Technologies Event Sourcing, Command Query Responsibility Separation: Turning the Database Upside Down Key-Val Document Graph Wide Column Time Series Relational ???Events Updates
  • 31. © 2017 MapR Technologies Use Case: Streaming System of Record for Healthcare Objective: •  Build a flexible, secure healthcare exchange Records Analysis Applications Challenges: •  Many different data models •  Security and privacy issues •  HIPAA compliance Records
  • 32. © 2017 MapR Technologies32 ALLOY Health: Exchange State HIE Clinical Data Viewer Reporting and Analytics Clinical Data Financial Data Provider Organizations What are the outcomes in the entire state on diabetes? Are there doctors that are doing this better than others? Georgia Health Connect
  • 33. © 2017 MapR Technologies Use Case: Streaming System of Record for Healthcare
  • 34. © 2017 MapR Technologies Spark Dataset
  • 35. © 2017 MapR Technologies Spark Distributed Datasets Dataset W Executor P4 W Executor P1 P3 W Executor P2 partitioned Partition 1 8213034705, 95, 2.927373, jake7870, 0…… Partition 2 8213034705, 115, 2.943484, Davidbresler2, 1…. Partition 3 8213034705, 100, 2.951285, gladimacowgirl, 58… Partition 4 8213034705, 117, 2.998947, daysrus, 95…. •  Read only collection of typed objects Dataset[T] •  Partitioned across a cluster •  Operated on in parallel •  can be Cached
  • 36. © 2017 MapR Technologies val df: Dataset[Payment] = spark.read.json(”/p/file.json").as[Payment] Spark Distributed Datasets read from a file Worker Task Worker Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data Task Task Block 1 Driver tasks tasks tasks
  • 37. © 2017 MapR Technologies DataFrame is like a table Dataset[Row] row columns DataFrame = Dataset[Row] can use Spark SQL
  • 38. © 2017 MapR Technologies A Dataset is a collection of typed objects Dataset[objects] objects columns Dataset[Typed Object] can use Spark SQL and Functions
  • 39. © 2017 MapR Technologies Spark Streaming
  • 40. © 2017 MapR Technologies Collect Data Process the Data with Spark Streaming Process Data Stream Topic •  scalable, high-throughput, stream processing of live data
  • 41. © 2017 MapR Technologies What is a Stream ? Producers Consumers •  A stream is an continuous sequence of events or records •  Records are key-value pairs Stream of Data key value key value key value key value
  • 42. © 2017 MapR Technologies Data stream Unbounded Table new data in the data stream = new rows appended to an unbounded table Data stream as an unbounded table Treat Stream as Unbounded Tables
  • 43. © 2017 MapR Technologies Spark Distributed Datasets read from Stream partitions Task Cache Process & Cache Data offsets Stream partition Task Cache Process & Cache Data Task Cache Process & Cache Data Driver Stream partition Stream partition Data is cached for aggregations And windowed functions
  • 44. © 2017 MapR Technologies Streaming data = Unbounded table Static Data = bounded table Same Dataset operations & SQL Stream Processing on Spark SQL Engine
  • 45. © 2017 MapR Technologies Conceptual model incremental query 3.  Append
  • 46. © 2017 MapR Technologies Continuous incremental execution Spark SQL converts queries to incremental execution plans For input of data Incremental Incremental Incremental
  • 47. © 2017 MapR Technologies Use Case: Payment Data Payment input data Stream Input "NEW","Covered Recipient Physician",,,,"132655","GREGG","D","ALZATE",,"8745 AERO DRIVE","STE 200","SAN DIEGO","CA","92123","United States",,,"Medical Doctor","Allopathic & Osteopathic Physicians|Radiology|Diagnostic Radiology","CA",,,,,"DFINE, Inc","100000000326","DFINE, Inc","CA","United States", 90.87,"02/12/2016","1","In-kind items and services","Food and Beverage",,,,"No","No Third Party Payment",,,,,"No","346039438","No","Yes","Covered","Device","Radiology","StabiliT", ,"Covered","Device","Radiology","STAR Tumor Ablation System",,,,,,,,,,,,,,,,,"2016","06/30/2017" transform Spark Streaming { "_id":"317150_08/26/2016_346122858", "physician_id":"317150", "date_payment":"08/26/2016", "record_id":"346122858", "payer":"Mission Pharmacal Company", "amount":9.23, "Physician_Specialty":"Obstetrics & Gynecology", "Nature_of_payment":"Food and Beverage" } JSON
  • 48. © 2017 MapR Technologies Use Case: Open Payment Dataset •  Payments Drug and Device companies make to •  Physicians and Teaching Hospitals for •  Travel, Research, Gifts, Speaking fees, and Meals
  • 49. © 2017 MapR Technologies Scenario: Payment Data Provider ID Date Payer Payer State Provider Specialty Provider State Amount Payment Nature 1261770 01/11/2016 Southern Anesthesia & Surgical, Inc CO Oral and Maxillofacial Surgery CA 117.5 Food and Beverage
  • 50. © 2017 MapR Technologies Stream the data into a Dataframe: Define the Schema case class Payment(physician_id: String, date_payment: String, payer: String, payer_state: String amount: Double, physician_specialty: String, phys_state: String, nature_of_payment:String) val schema = StructType(Array( StructField("_id", StringType, true), StructField("physician_id", StringType, true), StructField("date_payment", StringType, true), StructField("payer", StringType, true), StructField("payer_state", StringType, true), StructField("amount", DoubleType, true), StructField("physician_specialty", StringType, true), StructField("physician_type", StringType, true), StructField("physician_state", StringType, true), StructField("nature_of_payment", StringType, true) ))
  • 51. © 2017 MapR Technologies Function to Parse CSV into Payment Class def parse(str: String): Payment = { val td = str.split(",(?=([^"]*"[^"]*")*[^"]*$)") val physician_id = td(5) val payer = td(27) . . . val physician_state = td(20) var focus =td(19) val id =physician_state+'_’+focus+ '_’+ date_payment+'_' + record_id Payment(id, physician_id, date_payment, payer, payer_state, amount, physician_type, focus, physician_state, nature_of_payment) }
  • 52. © 2017 MapR Technologies Parsed and Transformed Payment Data { "_id":”TX_Gynecology_08/26/2016_346122858", "physician_id":"317150", "date_payment":"08/26/2016", "payer":"Mission Pharmacal Company", "payer_state":”CO", "amount":9.23, ”physician_specialty":”Gynecology", “physician_state":”TX" ”nature_of_payment":"Food and Beverage" } Example Dataset Row
  • 53. © 2017 MapR Technologies Streaming pipeline Data source Specify data source returns a dataframe val df1 = spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load()
  • 54. © 2017 MapR Technologies Transformation Cast bytes from Kafka records to a string, parse csv , and return Dataset[Payment] spark.udf.register("deserialize", (message: String) => parse(message)) val df2=df1 .selectExpr("""deserialize(CAST(value as STRING)) AS message""") .select($"message".as[Payment])
  • 55. © 2017 MapR Technologies Streams Stream Processing Stream Processing Storage Raw Enriched Filtered Stream Processing: •  Filtering •  Transformations •  Aggregations •  Enrichments with ML •  Enrichments with joins MapR-DB MapR-XD
  • 56. © 2017 MapR Technologies Dataframe Integrated Queries L I Query Description agg(expr, exprs) Aggregates on entire DataFrame distinct Returns new DataFrame with unique rows except(other) Returns new DataFrame with rows from this DataFrame not in other DataFrame filter(expr); where(condition) Filter based on the SQL expression or condition groupBy(cols: Columns) Groups DataFrame using specified columns join (DataFrame, joinExpr) Joins with another DataFrame using given join expression sort(sortcol) Returns new DataFrame sorted by specified column select(col) Selects set of columns
  • 57. © 2017 MapR Technologies Continuous aggregations Continuously compute average payment amountval d3=df2.avg(“amount")
  • 58. © 2017 MapR Technologies Continuous aggregations and filter val d3=df2.groupBy(“payer") .avg(“amount")
  • 59. © 2017 MapR Technologies Continuous aggregations and filter val d3=df2 .filter($"amount" > 20000)
  • 60. © 2017 MapR Technologies Streaming pipeline Kafka topic Data Sink Write results to Kafka topic Start running the query val query = df3.write .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("topic", "/apps/uberstream:uberp") query.start.awaitTermination()
  • 61. © 2017 MapR Technologies Streaming Applicaton Streaming Dataset Streaming Dataset Transformed Stream Topic Stream Topic Stream Topic Stream Topic Stream Topic Stream Topic
  • 62. © 2017 MapR Technologies Spark & MapR-DB
  • 63. © 2017 MapR Technologies AnalyzeStore DataCollect Data What Do We Need to Do ? Process DataData Sources ? ? ? ? Stream Topic Spark Streaming JSON
  • 64. © 2017 MapR Technologies Stream Processing Pipeline Data Collect Process Store Stream Topic Spark Streaming Kafka API SQL Open JSON API Analyze JSON SQL
  • 65. © 2017 MapR Technologies Where/How to store data ?
  • 66. © 2017 MapR Technologies Relational Database vs. MapR-DB bottleneck Storage ModelRDBMS MapR-DB Normalized schema à Joins for queries can cause bottleneck De-Normalized schema à Data that is read together is stored together Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val
  • 67. © 2017 MapR Technologies MapR-DB JSON Document Store Data that is read together is stored together
  • 68. © 2017 MapR Technologies Designed for Partitioning and Scaling Key Range xxxx xxxx Key Range xxxx xxxx Key Range xxxx xxxx Fast Reads and Writes by Key! Data is automatically partitioned by Key Range! Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val
  • 69. © 2017 MapR Technologies MapR-DB JSON Document Store Data is automatically partitioned by Key Range!
  • 70. © 2017 MapR Technologies Payment Data Automatically sorted and partitioned by Key Range (_id) { "_id":”TX_Gynecology_08/26/2016_346122858", "physician_id":"317150", "date_payment":"08/26/2016", "payer":"Mission Pharmacal Company", "payer_state":”CO", "amount":9.23, ”physician_specialty":”Gynecology", “physician_state":”TX" ”nature_of_payment":"Food and Beverage" }
  • 71. © 2017 MapR Technologies Spark Streaming writing to MapR-DB JSON
  • 72. © 2017 MapR Technologies Spark MapR-DB Connector •  Connection object in every Spark Executor: •  distributed parallel writes & reads
  • 73. © 2017 MapR Technologies Use Case: Flight Delays Payment input data Stream Input "NEW","Covered Recipient Physician",,,,"132655","GREGG","D","ALZATE",,"8745 AERO DRIVE","STE 200","SAN DIEGO","CA","92123","United States",,,"Medical Doctor","Allopathic & Osteopathic Physicians|Radiology|Diagnostic Radiology","CA",,,,,"DFINE, Inc","100000000326","DFINE, Inc","CA","United States", 90.87,"02/12/2016","1","In-kind items and services","Food and Beverage",,,,"No","No Third Party Payment",,,,,"No","346039438","No","Yes","Covered","Device","Radiology","StabiliT", ,"Covered","Device","Radiology","STAR Tumor Ablation System",,,,,,,,,,,,,,,,,"2016","06/30/2017" transform Spark Streaming { "_id":"317150_08/26/2016_346122858", "physician_id":"317150", "date_payment":"08/26/2016", "record_id":"346122858", "payer":"Mission Pharmacal Company", "amount":9.23, "Physician_Specialty":"Obstetrics & Gynecology", "Nature_of_payment":"Food and Beverage" } JSON
  • 74. © 2017 MapR Technologies Streaming pipeline Data Sink Write result to maprdb Start running the query val query = df2.writeStream .format(MapRDBSourceConfig.Format) .option(MapRDBSourceConfig.TablePathOption, "/apps/paytable") .option(MapRDBSourceConfig.CreateTableOption, false) .option(MapRDBSourceConfig.IdFieldPathOption, "value") .outputMode(”append") query.start().awaitTermination()
  • 75. © 2017 MapR Technologies Streaming Applicaton Streaming Dataset Streaming Dataset Transformed Tablets Stream Topic Stream Topic Stream Topic Data is rapidly available for complex, ad-hoc analytics
  • 76. © 2017 MapR Technologies Explore the Data With Spark SQL
  • 77. © 2017 MapR Technologies AnalyzeStore DataCollect Data What Do We Need to Do ? Process DataData Sources ? ? ? Stream Topic Spark Streaming JSON SQL
  • 78. © 2017 MapR Technologies Spark SQL Querying MapR-DB JSON
  • 79. © 2017 MapR Technologies Data Frame Load data Load the data into a Dataframe val pdf: Dataset[Payment] = spark.loadFromMapRDB[Payment]("/apps/paytable", schema) .as[Payment] pdf.select("_id", "payer", "amount”).show
  • 80. © 2017 MapR Technologies val pdf: Dataset[Payment] = spark.loadFromMapRDB[Payment]("/apps/paytable", schema) .as[Payment] Spark Distributed Datasets read from MapR-DB Partitions Worker Task Worker Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data Task Task Driver tasks tasks tasks
  • 81. © 2017 MapR Technologies Language Integrated Queries L I Query Description agg(expr, exprs) Aggregates on entire DataFrame distinct Returns new DataFrame with unique rows except(other) Returns new DataFrame with rows from this DataFrame not in other DataFrame filter(expr); where(condition) Filter based on the SQL expression or condition groupBy(cols: Columns) Groups DataFrame using specified columns join (DataFrame, joinExpr) Joins with another DataFrame using given join expression sort(sortcol) Returns new DataFrame sorted by specified column select(col) Selects set of columns
  • 82. © 2017 MapR Technologies Top 5 Nature of Payment by count val res = pdf.groupBy(”Nature_of_Payment") .count() .orderBy(desc(count)) .show(5)
  • 83. © 2017 MapR Technologies Top 5 Nature of Payment by amount of payment %sql select Nature_of_payment, sum(amount) as total from payments group by Nature_of_payment order by total desc limit 5
  • 84. © 2017 MapR Technologies What are the Nature of Payments with payments > $1000 with count pdf.filter($"amount" > 1000) .groupBy("Nature_of_payment") .count().orderBy(desc("count")).show()
  • 85. © 2017 MapR Technologies Top 5 Physician Specialties by total Amount %sql select physician_specialty, sum(amount) as total from payments where physician_specialty IS NOT NULL group by physician_specialty order by total desc limit 5
  • 86. © 2017 MapR Technologies Average Payment by Specialty
  • 87. © 2017 MapR Technologies Top Payers by Total Amount with count %sql select payer, payer_state, count(*) as cnt, sum(amount) as total from payments group by payer, payer_state order by total desc limit 10
  • 88. © 2017 MapR Technologies Stream Processing Building a Complete Data Architecture MapR File System (MapR-XD) MapR Converged Data Platform MapR Database (MapR-DB) MapR Event Streams Sources/Apps Bulk Processing All of these components can run on the same cluster with the MapR Converged platform.
  • 89. © 2017 MapR Technologies Data Pipelines and Machine Learning Logistics Input Data + Actual Delay Input Data + Predictions Consumer withML Model 2 Consumer withML Model 1 Decoy results Consumer Consumer withML Model 3 Consumer Stream Archive Stream Scores Stream Input SQL SQL Real time Flight Data Stream Input Actual Delay Input Data + Predictions + Actual Delay Real Time dashboard + Historical Analysis
  • 90. © 2017 MapR Technologies
  • 91. © 2017 MapR Technologies To Learn More: •  MapR Free ODT http://learn.mapr.com/
  • 92. © 2017 MapR Technologies …helping you put data technology to work ●  Find answers ●  Ask technical questions ●  Join on-demand training course discussions ●  Follow release announcements ●  Share and vote on product ideas ●  Find Meetup and event listings Connect with fellow Apache Hadoop and Spark professionals community.mapr.com
  • 93. © 2017 MapR Technologies MapR Blog • https://www.mapr.com/blog/
  • 94. © 2017 MapR Technologies To Learn More: ETL Payment data pipeline •  https://mapr.com/blog/etl-pipeline-healthcare-dataset-with-spark-json-mapr- db/ •  https://mapr.com/blog/streaming-data-pipeline-transform-store-explore- healthcare-dataset-mapr-db/
  • 95. © 2017 MapR Technologies To Learn More: •  https://mapr.com/blog/how-stream-first-architecture-patterns-are- revolutionizing-healthcare-platforms/
  • 96. © 2017 MapR Technologies To Learn More: •  https://mapr.com/blog/ml-iot-connected-medical-devices/
  • 97. © 2017 MapR Technologies Applying Machine Learning to Live Patient Data •  https://www.slideshare.net/caroljmcdonald/applying-machine-learning-to- live-patient-data
  • 98. © 2017 MapR Technologies MapR Container for Developers • https://maprdocs.mapr.com/home/MapRContainerDevelopers/ MapRContainerDevelopersOverview.html
  • 99. © 2017 MapR Technologies MapR Data Science Refinery • https://mapr.com/products/data-science-refinery/
  • 100. © 2017 MapR Technologies MapR Data Platform
  • 101. © 2017 MapR Technologies Q&A ENGAGE WITH US