Contenu connexe
Similaire à Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with MapR Database (20)
Plus de Carol McDonald (13)
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with MapR Database
- 2. 2 © 2018 MapR Technologies, Inc
• Overview of Kafka API
• Use Spark Structured Streaming to
continously:
• Read from Kafka topic
• Transfom
• Write to MapR document
database
• Analyze using Spark SQL
Agenda
2
- 3. 3 © 2018 MapR Technologies, Inc
Use Case: Open Payment Dataset
• Payments Drug and Device companies make to
• Physicians and Teaching Hospitals for
• Travel, Research, Gifts, Speaking fees, and Meals
• https://www.cms.gov/openpayments/
- 4. 4 © 2018 MapR Technologies, Inc
• Medicare payment data for most common inpatient diagnoses
• Data shows dramatically different charges and payments
• https://www.cms.gov/research-statistics-data-and-systems/statistics-trends-and-
reports/medicare-provider-charge-data/inpatient.html
InPatient Payment Dataset (IPPS)
Provider ID Provider
State
DRG Total
Discharges
Average
Charges
Average
Total
payments
Avg
Medicare
Payments
1261770 CA
Heart
Transplant
13 1,172,866 251,876 244,457
- 6. 6 © 2018 MapR Technologies, Inc
What is a Stream ?
• A stream is a continuous sequence of events
• Events are key-value pairs
- 7. 7 © 2018 MapR Technologies, Inc
What is Streaming Data?
Fraud detection Smart Machinery Smart Meters Home Automation
Networks Manufacturing Security Systems Patient Monitoring
- 8. 8 © 2018 MapR Technologies, Inc
Monitoring devices combined with ML can provide alerts for Sepsis,
which is one of the leading causes for death in hospitals
– http://www.computerweekly.com/news/450422258/Putting-sepsis-
algorithms-into-electronic-patient-records
Examples of Streaming Data
- 9. 9 © 2018 MapR Technologies, Inc
A Stanford team has shown that a machine-learning model can identify arrhythmias
from an EKG better than an expert
• https://www.technologyreview.com/s/608234/the-machines-are-getting-ready-
to-play-doctor/
Example of Streaming Data combined with Machine Learning
- 10. 10 © 2018 MapR Technologies, Inc
the ECG app on the Apple Watch can check heart rhythms and send a notification if an
irregular heart rhythm is identified.
• https://www.apple.com/newsroom/2018/12/ecg-app-and-irregular-heart-
rhythm-notification-available-today-on-apple-watch/
Example of Streaming Data combined with Machine Learning
- 11. 11 © 2018 MapR Technologies, Inc
https://mapr.com/blog/ml-iot-connected-medical-devices/
Applying Machine Learning to Live Patient Data
- 13. 13 © 2018 MapR Technologies, Inc
Intro to the Kafka API
- 14. 14 © 2018 MapR Technologies, Inc
Topics:
Logical collection of events
Organize Events into Categories
Organize Data into Topics with the MapR Event Store for Kafka
- 15. 15 © 2018 MapR Technologies, Inc
Topics:
Logical collection of events
Organize Events into Categories
Organize Data into Topics with the MapR Event Store for Kafka
Consumers
MapR Cluster
Topic: Pressure
Topic: Temperature
Topic: Warnings
Consumers
Consumers
Kafka API Kafka API
- 16. 16 © 2018 MapR Technologies, Inc
Topics are partitioned for throughput and
scalability
Scalable Messaging with MapR Event Streams
Server 1
Partition1: Topic - Pressure
Partition1: Topic - Temperature
Partition1: Topic - Warning
Server 2
Partition2: Topic - Pressure
Partition2: Topic - Temperature
Partition2: Topic - Warning
Server 3
Partition3: Topic - Pressure
Partition3: Topic - Temperature
Partition3: Topic - Warning
- 17. 17 © 2018 MapR Technologies, Inc
New Messages are
Added to the end
Partition is like an Event Log
New
Message
6 5 4 3 2 1
Old
Message
- 18. 18 © 2018 MapR Technologies, Inc
Messages are delivered in the order they are received
Partition is like a Queue
- 19. 19 © 2018 MapR Technologies, Inc
Events remain on the partition, available to other consumers
Unlike a queue, Events are not deleted when Read
- 20. 20 © 2018 MapR Technologies, Inc
Messages can be persisted forever
Or
Older messages can be deleted automatically based on time to live
Not deleting messages, minimizes disk read/writes
When Are Messages Deleted?
MapR Cluster
6 5 4 3 2 1Partition
1
Older
message
- 21. 21 © 2018 MapR Technologies, Inc
High Performance at Scale:
• Partitioning for Parallel operations
• Not deleting messages, minimizes disk read/writes
- 22. 22 © 2018 MapR Technologies, Inc
Processing Same Message for Different Purposes
- 23. 23 © 2018 MapR Technologies, Inc
Georgia Health Connect
ALLOY Health:
Exchange State HIE
Clinical Data Viewer
Reporting and Analytics
Clinical Data
Financial Data
Provider
Organizations
What are the outcomes in the entire
state on diabetes?
Are there doctors that are doing this
better than others?
Georgia Health Connect
- 24. 24 © 2018 MapR Technologies, Inc
Use Case: Streaming System of Record for Healthcare
- 26. 26 © 2018 MapR Technologies, Inc
Spark Distributed Datasets
partitioned
• Distributed collection of objects
Dataset[T]
• Partitioned across a cluster
• Operated on in parallel
• in memory can be Cached
- 27. 27 © 2018 MapR Technologies, Inc
DataFrame = Dataset[Row] can use Spark SQL
DataFrame is like a table
Dataset[Row]
row
columns
- 28. 28 © 2018 MapR Technologies, Inc
Dataset[Typed Object] can use Spark SQL and Functions
A Dataset is a distributed collection of objects
Dataset[objects]
object
columns
partitioned
- 29. 29 © 2018 MapR Technologies, Inc
Process the Data with Spark Structured Streaming
- 30. 30 © 2018 MapR Technologies, Inc
Datasets Read from Stream
Task
Cache
Process
& Cache
Data
offsets
Stream
partition
Task
Cache
Process
& Cache
Data
Task
Cache
Process
& Cache
Data
Driver
Stream
partition
Stream
partition
Data is cached for
aggregations
And windowed
functions
- 31. 31 © 2018 MapR Technologies, Inc
new data in the
data stream
=
new rows appended
to an unbounded table
Data stream as an unbounded table
Treat Stream as Unbounded Tables
- 32. 32 © 2018 MapR Technologies, Inc
The Stream is continuously processed
- 33. 33 © 2018 MapR Technologies, Inc
Spark automatically streamifies SQL plans
Image reference Databricks
- 34. 34 © 2018 MapR Technologies, Inc
Stream Processing
- 35. 35 © 2018 MapR Technologies, Inc
Use Case: Payment Data
"NEW","Covered Recipient Physician",,,,"132655","GREGG","D","ALZATE",,"8745
AERO DRIVE","STE 200","SAN DIEGO","CA","92123","United States",,,"Medical
Doctor","Allopathic & Osteopathic Physicians|Radiology|Diagnostic
Radiology","CA",,,,,"DFINE, Inc","100000000326","DFINE, Inc","CA","United States",
90.87,"02/12/2016","1","In-kind items and services","Food and Beverage",,,,"No","No
Third Party
Payment",,,,,"No","346039438","No","Yes","Covered","Device","Radiology","StabiliT",
,"Covered","Device","Radiology","STAR Tumor Ablation
System",,,,,,,,,,,,,,,,,"2016","06/30/2017"
{
"_id":"317150_08/26/2016_346122858",
"physician_id":"317150",
"date_payment":"08/26/2016",
"record_id":"346122858",
"payer":"Mission Pharmacal Company",
"amount":9.23,
"Physician_Specialty":"Obstetrics & Gynecology",
"Nature_of_payment":"Food and Beverage"
}
- 36. 36 © 2018 MapR Technologies, Inc
Scenario: Payment Data
Provider ID Date Payer Payer
State
Provider
Specialty
Provider
State
Amount Payment
Nature
1261770 01/11/2016
Southern
Anesthesia &
Surgical, Inc
CO
Oral and
Maxillofacial
Surgery
CA 117.5 Food and Beverage
- 37. 37 © 2018 MapR Technologies, Inc
val df1 = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "maprdemo:9092")
.option("subscribe", "/apps/stream:payments”)
.option("startingOffsets", "earliest")
.option("failOnDataLoss", false)
.option("maxOffsetsPerTrigger", 1000)
.load()
Streaming pipeline Kafka Data source
- 38. 38 © 2018 MapR Technologies, Inc
df1.printSchema()
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
Kafka DataFrame schema
- 39. 39 © 2018 MapR Technologies, Inc
case class Payment(_id: String, physician_id: String,
date_payment: String, payer: String, payer_state: String,
amount: Double, physician_type: String,physician_specialty: String,
physician_state: String, nature_of_payment: String)
def parsePayment(str: String): Payment = {
val td = str.split(",(?=([^"]*"[^"]*")*[^"]*$)")
val physician_id =td(5)
val amount = td(30).toDouble
. . .
Payment(id, physician_id, date_payment, payer, payer_state, amount,
physician_type, focus, physician_state, nature_of_payment)
}
Function to Parse CSV data to Payment Object
- 40. 40 © 2018 MapR Technologies, Inc
//register a user-defined function (UDF) to deserialize the message
spark.udf.register("deserialize",
(message: String) => parsePayment(message))
//use the UDF in a select expression
val df2 = df1.selectExpr("""deserialize(CAST(value as STRING)) AS
message""").select($"message".as[Payment])
Parse message txt to Payment Object
- 41. 41 © 2018 MapR Technologies, Inc
Writing to a Memory Sink
Write results to Memory
Start running the query
val query = cdf
.writeStream
.queryName("uber")
.format("memory")
.outputMode("append”)
query.start().awaitTermination()
- 42. 42 © 2018 MapR Technologies, Inc
%sql select * from payments limit 4:
Streaming Applicaton
+--------------------+------------+------------+--------------------+-----------+--------+-------------------+--------------------+---------------+------------------+
| _id|physician_id|date_payment| payer|payer_state| amount| physician_type| physician_specialty|physician_state| nature_of_payment|
+--------------------+------------+------------+--------------------+-----------+--------+-------------------+--------------------+---------------+------------------+
|CA_Diagnostic Rad...| 132655| 02/12/2016| DFINE, Inc| CA| 90.87| Medical Doctor|Diagnostic Radiology| CA| Food and Beverage|
|AZ_Vascular & Int...| 1006832| 01/26/2016| DFINE, Inc| CA| 25.0| Medical Doctor|Vascular & Interv...| AZ|Travel and Lodging|
|AZ_Vascular & Int...| 1006832| 02/13/2016| DFINE, Inc| CA| 32.0| Medical Doctor|Vascular & Interv...| AZ|Travel and Lodging|
|AZ_Vascular & Int...| 1006832| 02/19/2016| DFINE, Inc| CA| 27.27| Medical Doctor|Vascular & Interv...| AZ|Travel and Lodging|
- 43. 43 © 2018 MapR Technologies, Inc
%sql select * from payments limit 3:
Streaming Applicaton
- 44. 44 © 2018 MapR Technologies, Inc
%sql select physician_specialty, count(*) as cnt from payment2
group by physician_specialty order by cnt desc
Streaming Applicaton
- 46. 46 © 2018 MapR Technologies, Inc
MapR-DB Connector for Apache Spark
Spark Streaming writing to MapR-DB JSON
- 47. 47 © 2018 MapR Technologies, Inc
Spark MapR-DB Connector
- 48. 48 © 2018 MapR Technologies, Inc
Designed for Partitioning and Scaling
- 49. 49 © 2018 MapR Technologies, Inc
MapR-DB JSON Document Store
Data is automatically partitioned
and sorted by _id row key!
{
"_id":”AK_08/26/2016_346122858",
"physician_id":"317150",
"date_payment":"08/26/2016",
"payer":"Mission Pharmacal Company",
"payer_state":”CO",
"amount":9.23,
”physician_specialty":”Gynecology",
“physician_state":”AK"
”nature_of_payment":"Food and Beverage"
}
- 50. 50 © 2018 MapR Technologies, Inc
MapR Database shell
maprdb mapr:> find /user/mapr/paytable --limit 2
{"_id":"AK_01/03/2016_1309545_391702408", "amount":94.21,
"date_payment":"01/03/2016", "nature_of_payment":"Food and Beverage",
"payer":"Stryker Corporation", "payer_state":"MI","physician_id":"1309545",
"physician_specialty":"Specialist", "physician_state":"AK",
"physician_type":"Medical Doctor"}
{"_id":"AK_01/04/2016_100801_396460394","amount":11.59,
"date_payment":"01/04/2016", "nature_of_payment":"Food and Beverage",
"payer":"AbbVie, Inc.", "payer_state":"IL", "physician_id":"100801",
"physician_specialty":"Family Medicine", "physician_state":"AK",
"physician_type":"Medical Doctor"}
2 document(s) found.
Data is automatically partitioned
and sorted by _id row key!
- 51. 51 © 2018 MapR Technologies, Inc
Writing to a MapR-DB Sink
Write Streaming DataFrame
Query Results to MapR-DB
Start running the query
val query = cdf.writeStream
.format(MapRDBSourceConfig.Format)
.option(MapRDBSourceConfig.TablePathOption, tableName)
.option(MapRDBSourceConfig.IdFieldPathOption, "_id")
.option(MapRDBSourceConfig.CreateTableOption, false)
.option("checkpointLocation", "/user/mapr/check")
.option(MapRDBSourceConfig.BulkModeOption, true)
.option(MapRDBSourceConfig.SampleSizeOption, 1000)
query.start().awaitTermination()
- 52. 52 © 2018 MapR Technologies, Inc
Streaming Dataset Read from Topic partitions, Transformed,
Written to MapR Database partitions
- 53. 53 © 2018 MapR Technologies, Inc
Store in MapR Database
- 55. 55 © 2018 MapR Technologies, Inc
• Spark SQL queries and updates to MapR-DB
• With projection and filter pushdown, custom partitioning, and data locality
Spark SQL Querying MapR-DB JSON
- 56. 56 © 2018 MapR Technologies, Inc
val pdf: Dataset[Payment] = spark
.loadFromMapRDB[Payment](tableName, schema)
.as[Payment]
Spark Distributed Datasets read from MapR-DB Partitions
Worker
Task
Worker
Driver
Cache 1
Cache 2
Cache 3
Process
& Cache
Data
Process
& Cache
Data
Process
& Cache
Data
Task
Task
Driver
tasks
tasks
tasks
- 57. 57 © 2018 MapR Technologies, Inc
Data
Frame
Load data
pdf.createOrReplaceTempView(”payment")
pdf.select("_id", "payer", "amount”).show
Load the data into a Dataframe
Data is automatically partitioned
and sorted by _id row key!
- 58. 58 © 2018 MapR Technologies, Inc
• Which physician specialties, or physician states, are getting the most payments ?
• Count , total amount $, average amount $, standard deviation
• Which companies, or company states, are making the most payments ?
• Count , total amount $, average amount $, standard deviation
• What are the Top Nature of Payments by count, amount ?
• What are the payments over $2,000,000 ?
Now We Can Ask Questions Like:
Physician ID Date Payer Payer
State
Physician
Specialty
Physician
State
Amount Payment
Nature
1261770 01/11/2016
Southern
Anesthesia &
Surgical, Inc
CO
Oral and
Maxillofacial
Surgery
CA 117.5 Food and Beverage
- 59. 59 © 2018 MapR Technologies, Inc
Language Integrated Queries
L I Query Description
agg(expr, exprs) Aggregates on entire DataFrame
distinct Returns new DataFrame with unique rows
except(other) Returns new DataFrame with rows from this DataFrame not in other
DataFrame
filter(expr);
where(condition)
Filter based on the SQL expression or condition
groupBy(cols:
Columns)
Groups DataFrame using specified columns
join (DataFrame,
joinExpr)
Joins with another DataFrame using given join expression
sort(sortcol) Returns new DataFrame sorted by specified column
select(col) Selects set of columns
- 60. 60 © 2018 MapR Technologies, Inc
MapR Data Platform
- 61. 61 © 2018 MapR Technologies, Inc
Top 5 Nature of Payment by count
val res = pdf.groupBy(”Nature_of_Payment")
.count()
.orderBy(desc(count))
.show(5)
- 62. 62 © 2018 MapR Technologies, Inc
Top 5 Nature of Payment by amount of payment
%sql select Nature_of_payment,
sum(amount) as total from payments
group by Nature_of_payment order by total desc limit 5
- 63. 63 © 2018 MapR Technologies, Inc
Top 5 Physician Specialties by total Amount
%sql select physician_specialty, sum(amount) as total
from payments where physician_specialty IS NOT NULL
group by physician_specialty order by total desc limit 5
- 64. 64 © 2018 MapR Technologies, Inc
Average Payment by Specialty
- 65. 65 © 2018 MapR Technologies, Inc
What are the Nature of Payments with payments > $1000 ?
pdf.filter($"amount" > 1000)
.groupBy("Nature_of_payment")
.count().orderBy(desc("count")).show()
- 66. 66 © 2018 MapR Technologies, Inc
Top Payers by Total Amount with count
%sql select payer, payer_state, count(*) as cnt,
sum(amount) as total from payments
group by payer, payer_state order by total desc limit 10
- 67. 67 © 2018 MapR Technologies, Inc
Scenario: InPatient Payment Data
Provider ID Provider
State
DRG Total
Discharges
Average
Charges
Average
Total
payments
Avg
Medicare
Payments
1261770 CA
Heart
Transplant
13 1,172,866 251,876 244,457
- 68. 68 © 2018 MapR Technologies, Inc
MapR Database shell
maprdb mapr:> find /user/mapr/ippstable --limit 2
{"_id":"001_2014_010033", "avg_covered_charges":1172866.385, "avg_medicare_payments":
244457.9231, "avg_total_payments":251876.3077, "drg_code":"001", "drg_definition":"HEART
TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM W MCC", "provider_address":"619
SOUTH 19TH STREET", "provider_city":"BIRMINGHAM", "provider_id":"010033",
"provider_name":"UNIVERSITY OF ALABAMA HOSPITAL", "provider_region":"AL -
Birmingham", "provider_state":"AL", "provider_zip":"35233", "total_discharges":13,
"year":"2014"}
{"_id":"001_2014_030103", "avg_covered_charges":437531.3, "avg_medicare_payments":
133509.55, "avg_total_payments":240422.8, "drg_code":"001", "drg_definition":"HEART
TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM W MCC",
"provider_address":"5777 EAST MAYO BOULEVARD", "provider_city":"PHOENIX",
"provider_id":"030103", "provider_name":"MAYO CLINIC HOSPITAL", "provider_region":"AZ -
Phoenix", "provider_state":"AZ", "provider_zip":"85054", "total_discharges":20, "year":"2014”}
- 69. 69 © 2018 MapR Technologies, Inc
• What are the top charges, and payments by Hospital?
• by Hospital State?
• What are the top charges, and payments by diagnosis?
• Most common diagnosis ?
• What are the statistics for payments for most common diagnosis (Sepsis )
• What are the most expensive states for most common diagnosis sepsis
Now We Can Ask Questions Like:
Provider ID Provider
State
DRG Total
Discharges
Average
Charges
Average
Total
payments
Avg
Medicare
Payments
1261770 CA
Heart
Transplant
13 1,172,866 251,876 244,457
- 70. 70 © 2018 MapR Technologies, Inc
InPatient Prospective Payment System payments
- 71. 71 © 2018 MapR Technologies, Inc
To Download the code : Streaming data pipeline blogs
• https://mapr.com/blog/etl-pipeline-healthcare-dataset-with-spark-json-mapr-db/
• https://mapr.com/blog/streaming-data-pipeline-transform-store-explore-healthcare-
dataset-mapr-db/
- 72. 72 © 2018 MapR Technologies, Inc
http://mapr.com/ebooks/
New Spark Ebook
- 73. 73 © 2018 MapR Technologies, Inc
Read about Streaming Architectures
mapr.com/ebooks
- 74. 74 © 2018 MapR Technologies, Inc
Read about Machine Learning Logistics
mapr.com/ebooks
- 75. 75 © 2018 MapR Technologies, Inc
MapR Free ODT http://learn.mapr.com/
To Learn More: New Spark 2.0 training
- 76. 76 © 2018 MapR Technologies, Inc
https://mapr.com/blog/
MapR Blog
- 77. 77 © 2018 MapR Technologies, Inc
To Learn More:
• https://mapr.com/blog/ml-iot-connected-medical-devices/
- 78. 78 © 2018 MapR Technologies, Inc
To Learn More:
• https://mapr.com/blog/how-stream-first-architecture-patterns-are-revolutionizing-
healthcare-platforms/
- 79. 79 © 2018 MapR Technologies, Inc
Applying Machine Learning to Live Patient Data
• https://www.slideshare.net/caroljmcdonald/applying-machine-learning-to-live-
patient-data