SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Peter James, Healthdirect Australia
How Australia’s National Health Services
Directory Improved Data Quality, Reliability, and
Integrity with Databricks Delta and Structured
Streaming
#UnifiedAnalytics #SparkAISummit
About Us
Healthdirect Australia designs and delivers
innovative services for governments to provide
every Australian with 24/7 access to the trusted
information and advice they need to manage
their own health and health-related issues.
3#UnifiedAnalytics #SparkAISummit
Our Impact
4#UnifiedAnalytics #SparkAISummit
National Health Services Directory
The National Health Services Directory (NHSD) holds core
information about healthcare organisations, services and
practitioners.
• Hospital, EMR, Registry Data sharing Arrangements
• Clinical Pathway Applications
• Telehealth and Contact Centre Services
• Support eHealth Secure Messaging Delivery
• API Integration using Fast Healthcare Interoperability Resources (FHIR)
• Consumer API’s and Web Widgets
• Sector Analytics and Reporting
5#UnifiedAnalytics #SparkAISummit
Supporting Health Planning
NHSD is used widely for analysis & planning
6#UnifiedAnalytics #SparkAISummit
healthmap.com.au Royal Flying Doctors Service’s, SPOT
platform
Landscape Changes
A Strategic Assessment defined a long-range vision of
success
• Australian Health Sector promotes adoption of HL7/FHIR, Provider
Directory Federation and Open Data standards
• Business drivers pushing for Cost reduction and operational efficiencies
• Architecture needed to scale to meet the objectives defined for future
success
• Need to develop new Analytics capabilities in support of Sector Health
Planning
• Improve Information Security to support with broader usage scenarios and
regulatory changes
7#UnifiedAnalytics #SparkAISummit
Challenges
Data Quality and Governance
• Availability and access to Systems Of Record
• Often no Single Source Of Truth
• High variety of data sources from Federal, State, Public/Private Hospitals,
EMR, and other Commercial Vendors
• Health Domains are complex - Ontologies, Taxonomies, Thesauri, Code
sets (e.g. MeSH, SNOMED-CT, ANZSIC)
• Record Linking Issues, identity harmonisation, quasi-identifiers
• Disjoint Data Governance (Network of Data Governance participants)
8#UnifiedAnalytics #SparkAISummit
Challenges
Data Silos
• Data stored in multiple subsystems
– File Systems, ETL and workflow transient storage systems
– Multiple RDBMS, Multiple Schemas
– Search Engine Indexes
• Read Inconsistency
– Data out-of-synch between Search, Databases and Analytics storage
– Large batch updates of low quality data leading to high error rates and process inefficiencies
– Transactional boundaries per Data Silo no holistic end-to-end consistency
• Data Access
– High operational overheads processes
– Incompatible with Security boundaries
9#UnifiedAnalytics #SparkAISummit
Data Silos to Data Aggregation
Single point of access to the component pieces
10#UnifiedAnalytics #SparkAISummit
AHPRA
(Registries)
Medicare
Jurisdictions
Aged
Care
Secure
Message
Vendors
Desktop
Vendors
Hospital
Networks
Others
NHSD
Challenges
Data Scale
• Processes need to scale to ~1 Billion Data Points
• New demands include Bookings, Appointments, Referrals,
Pricing and eHealth Transaction Activity – est. 1TB p.a.
• Support Data federation and Interoperability requirements with
requests growing > 58% p.a.
• Existing systems were already under pressure with batch
overruns and significant administration overheads
11#UnifiedAnalytics #SparkAISummit
Architecture Overview
12#UnifiedAnalytics #SparkAISummit
Architecture Improvements
Databricks DELTA to create Logical Data Zones
• LANDING, RAW, STAGE, GOLD (i.e. Bronze, Silver, Gold)
• Store data ‘as-is’ either structured and/or unstructured data in DELTA Tables
and Physical Partitions
• Read Consistency, Data integrity - ACID transactions
• Used DELTA Cache for frequent queries with transparent, automated cache
control
• Databricks Operational Control Plane for Cluster Administration,
Management functions like, Access Control, Jobs and Schedules
• Data Control Plane runs within our AWS Accounts under our Security Policy
and regulatory compliance
13#UnifiedAnalytics #SparkAISummit
Architecture Improvements
SPARK Structured Streaming for Continuous Applications
• Create a Streaming versions of existing batch ETL jobs.
• Move to Event based Data flows via Streaming Input, Kinesis, S3, and
DELTA
• Running more frequently lead to smaller and more manageable data
volumes
• Automation of release processes through Continuous Delivery and use
of Databricks REST API
• Recoverability through Checkpoints and Reliability through Streaming
Sinks to DELTA tables
14#UnifiedAnalytics #SparkAISummit
Databricks Cluster
Continuous Applications –
Continued
• Keeps 'in-memory’ state of object sets
including comparable versions for object
changes
• Stateful transformations
(flatMapGroupsWithState)
• Aggregations for Data Quality
measurements
• Built User-defined libraries for Data
Cleansing, Validation and complex logic
15#UnifiedAnalytics #SparkAISummit
def appendAttributeStats(
self, sourceDF,
attribute_name,
metrics = ['mean', 'min',
'max', 'median',
'num_missing',
'num_unique']):
Data Plane & Pipeline
16#UnifiedAnalytics #SparkAISummit
Architecture Improvements
Unified Analytics and Operations
• Data cleaning & matching algorithms reliable, predictable and measurable
• Data Quality Analytics calculated at runtime, attached to object attributes,
populate analysts dashboards
• Track Data Lineage and Lifecycles providing Operational visibility of where our
data is at any point
• Complex processes easily broken down into simpler steps with ‘checks and
balances’ prior to execution
• Significant cost reduction through decommissioning complex Administration
UI in favour of notebook environment
17#UnifiedAnalytics #SparkAISummit
Improved Algorithm Development &
Usage
18#UnifiedAnalytics #SparkAISummit
Data Sharing Agreements Data Integration Algos &
Processes
Data Quality
Auditing
Algorithms
• sound design, goal-focused
• verified against trusted data
• version-controlled, documented
• archived with training & validation
data
Data
• permanent, versioned
• use provenance to enhance algos,
revise confidence estimate
Data Quality
Reporting
Algorithm Evaluation
19#UnifiedAnalytics #SparkAISummit
FPR TPR threshold
0.014433 0.800855 1.00
0.014433 0.801709 0.99
0.014433 0.831054 0.98
0.014433 0.855840 0.97
0.016495 0.895726 0.95
0.026804 0.950997 0.90
• interpretation
– the algorithm is appropriate to the domain
– the data is of good quality
– un-matchable names are rare exceptions
– a matching threshold of 0.95 gives
TPR = 95.1%
FPR = 2.7%
ROC curve for the set-based, fuzzy
name-matching algorithm
Improved Data Processing
• 95% fuzzy name matches vs. <80% and dependent
on manual verification
• ~30,000 Automated Updates / mth vs. ~30,000 bi-
annual & Manual
• 20 mins full load vs. > 24 hrs
• 20 million records @ ~1 million/min
20#UnifiedAnalytics #SparkAISummit
Improved Data Security
• Databricks Platform provided foundational Security Accreditations
• Able to cross-walk these security compliance frameworks and localize
to Australian Regulatory frameworks
• This provided significant cost/effort reduction of GRC
• Continuous Data Assurance - Security Guard-Rails monitor changes to
access privileges, e.g. role elevation/conditional access, data security
metadata data and auditing against data leakage and spills.
21#UnifiedAnalytics #SparkAISummit
Securing it ALL
22#UnifiedAnalytics #SparkAISummit
Other Benefits
• Don’t have to be big to see benefits, strategic value and a new business model
• Strategic momentum through realizing business vision with technology transformation
• Dataset Transparency and Explainable Models with well-documented lineage and
quality
• Broader participation - no one holds a single parts of knowledge anymore, we extract
more value in our data when everyone was able to access it
• Governance transitioned focus on operational improvements
• Inked Agreement in August 2018 completed migration to Databricks December–
Live March 2019
23#UnifiedAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Contenu connexe

Plus de Databricks

Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueDatabricks
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentDatabricks
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesDatabricks
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowDatabricks
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
 

Plus de Databricks (20)

Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot InstancesImproving Apache Spark for Dynamic Allocation and Spot Instances
Improving Apache Spark for Dynamic Allocation and Spot Instances
 
Importance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLowImportance of ML Reproducibility & Applications with MLfLow
Importance of ML Reproducibility & Applications with MLfLow
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 

Dernier

LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...ttt fff
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 

Dernier (20)

LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
毕业文凭制作#回国入职#diploma#degree美国加州州立大学北岭分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#de...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 

How Australia’s National Health Services Directory Improved Data Quality, Reliability, and Integrity with Databricks Delta and Structured Streaming

  • 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  • 2. Peter James, Healthdirect Australia How Australia’s National Health Services Directory Improved Data Quality, Reliability, and Integrity with Databricks Delta and Structured Streaming #UnifiedAnalytics #SparkAISummit
  • 3. About Us Healthdirect Australia designs and delivers innovative services for governments to provide every Australian with 24/7 access to the trusted information and advice they need to manage their own health and health-related issues. 3#UnifiedAnalytics #SparkAISummit
  • 5. National Health Services Directory The National Health Services Directory (NHSD) holds core information about healthcare organisations, services and practitioners. • Hospital, EMR, Registry Data sharing Arrangements • Clinical Pathway Applications • Telehealth and Contact Centre Services • Support eHealth Secure Messaging Delivery • API Integration using Fast Healthcare Interoperability Resources (FHIR) • Consumer API’s and Web Widgets • Sector Analytics and Reporting 5#UnifiedAnalytics #SparkAISummit
  • 6. Supporting Health Planning NHSD is used widely for analysis & planning 6#UnifiedAnalytics #SparkAISummit healthmap.com.au Royal Flying Doctors Service’s, SPOT platform
  • 7. Landscape Changes A Strategic Assessment defined a long-range vision of success • Australian Health Sector promotes adoption of HL7/FHIR, Provider Directory Federation and Open Data standards • Business drivers pushing for Cost reduction and operational efficiencies • Architecture needed to scale to meet the objectives defined for future success • Need to develop new Analytics capabilities in support of Sector Health Planning • Improve Information Security to support with broader usage scenarios and regulatory changes 7#UnifiedAnalytics #SparkAISummit
  • 8. Challenges Data Quality and Governance • Availability and access to Systems Of Record • Often no Single Source Of Truth • High variety of data sources from Federal, State, Public/Private Hospitals, EMR, and other Commercial Vendors • Health Domains are complex - Ontologies, Taxonomies, Thesauri, Code sets (e.g. MeSH, SNOMED-CT, ANZSIC) • Record Linking Issues, identity harmonisation, quasi-identifiers • Disjoint Data Governance (Network of Data Governance participants) 8#UnifiedAnalytics #SparkAISummit
  • 9. Challenges Data Silos • Data stored in multiple subsystems – File Systems, ETL and workflow transient storage systems – Multiple RDBMS, Multiple Schemas – Search Engine Indexes • Read Inconsistency – Data out-of-synch between Search, Databases and Analytics storage – Large batch updates of low quality data leading to high error rates and process inefficiencies – Transactional boundaries per Data Silo no holistic end-to-end consistency • Data Access – High operational overheads processes – Incompatible with Security boundaries 9#UnifiedAnalytics #SparkAISummit
  • 10. Data Silos to Data Aggregation Single point of access to the component pieces 10#UnifiedAnalytics #SparkAISummit AHPRA (Registries) Medicare Jurisdictions Aged Care Secure Message Vendors Desktop Vendors Hospital Networks Others NHSD
  • 11. Challenges Data Scale • Processes need to scale to ~1 Billion Data Points • New demands include Bookings, Appointments, Referrals, Pricing and eHealth Transaction Activity – est. 1TB p.a. • Support Data federation and Interoperability requirements with requests growing > 58% p.a. • Existing systems were already under pressure with batch overruns and significant administration overheads 11#UnifiedAnalytics #SparkAISummit
  • 13. Architecture Improvements Databricks DELTA to create Logical Data Zones • LANDING, RAW, STAGE, GOLD (i.e. Bronze, Silver, Gold) • Store data ‘as-is’ either structured and/or unstructured data in DELTA Tables and Physical Partitions • Read Consistency, Data integrity - ACID transactions • Used DELTA Cache for frequent queries with transparent, automated cache control • Databricks Operational Control Plane for Cluster Administration, Management functions like, Access Control, Jobs and Schedules • Data Control Plane runs within our AWS Accounts under our Security Policy and regulatory compliance 13#UnifiedAnalytics #SparkAISummit
  • 14. Architecture Improvements SPARK Structured Streaming for Continuous Applications • Create a Streaming versions of existing batch ETL jobs. • Move to Event based Data flows via Streaming Input, Kinesis, S3, and DELTA • Running more frequently lead to smaller and more manageable data volumes • Automation of release processes through Continuous Delivery and use of Databricks REST API • Recoverability through Checkpoints and Reliability through Streaming Sinks to DELTA tables 14#UnifiedAnalytics #SparkAISummit
  • 15. Databricks Cluster Continuous Applications – Continued • Keeps 'in-memory’ state of object sets including comparable versions for object changes • Stateful transformations (flatMapGroupsWithState) • Aggregations for Data Quality measurements • Built User-defined libraries for Data Cleansing, Validation and complex logic 15#UnifiedAnalytics #SparkAISummit def appendAttributeStats( self, sourceDF, attribute_name, metrics = ['mean', 'min', 'max', 'median', 'num_missing', 'num_unique']):
  • 16. Data Plane & Pipeline 16#UnifiedAnalytics #SparkAISummit
  • 17. Architecture Improvements Unified Analytics and Operations • Data cleaning & matching algorithms reliable, predictable and measurable • Data Quality Analytics calculated at runtime, attached to object attributes, populate analysts dashboards • Track Data Lineage and Lifecycles providing Operational visibility of where our data is at any point • Complex processes easily broken down into simpler steps with ‘checks and balances’ prior to execution • Significant cost reduction through decommissioning complex Administration UI in favour of notebook environment 17#UnifiedAnalytics #SparkAISummit
  • 18. Improved Algorithm Development & Usage 18#UnifiedAnalytics #SparkAISummit Data Sharing Agreements Data Integration Algos & Processes Data Quality Auditing Algorithms • sound design, goal-focused • verified against trusted data • version-controlled, documented • archived with training & validation data Data • permanent, versioned • use provenance to enhance algos, revise confidence estimate Data Quality Reporting
  • 19. Algorithm Evaluation 19#UnifiedAnalytics #SparkAISummit FPR TPR threshold 0.014433 0.800855 1.00 0.014433 0.801709 0.99 0.014433 0.831054 0.98 0.014433 0.855840 0.97 0.016495 0.895726 0.95 0.026804 0.950997 0.90 • interpretation – the algorithm is appropriate to the domain – the data is of good quality – un-matchable names are rare exceptions – a matching threshold of 0.95 gives TPR = 95.1% FPR = 2.7% ROC curve for the set-based, fuzzy name-matching algorithm
  • 20. Improved Data Processing • 95% fuzzy name matches vs. <80% and dependent on manual verification • ~30,000 Automated Updates / mth vs. ~30,000 bi- annual & Manual • 20 mins full load vs. > 24 hrs • 20 million records @ ~1 million/min 20#UnifiedAnalytics #SparkAISummit
  • 21. Improved Data Security • Databricks Platform provided foundational Security Accreditations • Able to cross-walk these security compliance frameworks and localize to Australian Regulatory frameworks • This provided significant cost/effort reduction of GRC • Continuous Data Assurance - Security Guard-Rails monitor changes to access privileges, e.g. role elevation/conditional access, data security metadata data and auditing against data leakage and spills. 21#UnifiedAnalytics #SparkAISummit
  • 23. Other Benefits • Don’t have to be big to see benefits, strategic value and a new business model • Strategic momentum through realizing business vision with technology transformation • Dataset Transparency and Explainable Models with well-documented lineage and quality • Broader participation - no one holds a single parts of knowledge anymore, we extract more value in our data when everyone was able to access it • Governance transitioned focus on operational improvements • Inked Agreement in August 2018 completed migration to Databricks December– Live March 2019 23#UnifiedAnalytics #SparkAISummit
  • 24. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT