SlideShare a Scribd company logo
1 of 13
Download to read offline
December 2018
8 Guiding Principles
to Kickstart Your
Healthcare Big Data
Project
White Paper
Big Data technologies have seen widespread adoption
across different industries over the past 3-5 years, but
the healthcare is just starting to realize the benefits.
This is mainly due to the exponential growth of
unstructured and semi-structured healthcare
information.
With sensors and wearables becoming a part of our
daily lives, people and organizations now have access
to enormous amounts of data, e.g., step tracking,
heartbeat / blood pressure monitoring, calorie
tracking, sleep pattern analysis, etc.
The explosion in healthcare data, while posing massive
storage and processing challenges, also has the
potential to transform the way we use data to improve
outcomes, for example:
 Predicting future care needs for specific
populations
 Minimizing health risks by predicting specific
events well in advance
 Identifying / expediting process of identifying new
patterns in disease detection, etc.
Our experience with a large number of healthcare Big
Data projects has shown that most customers face
significant hurdles in kick-starting their Big Data
initiatives.
With limited or no experience, customers often realize
last-minute that their Big Data project
implementations don’t have the architectural
robustness to address future needs.
This white paper illustrates our experiences and
learnings across multiple Big Data implementation
projects. It contains a broad set of guidelines and best
practices around:
 Building highly secure Big Data lakes
 Efficiently processing vast amounts of data
 Providing access to downstream systems
 Best practices to mitigate project risks
 Technical hurdles and approaches to overcome
them
OVERVIEW OF BIG DATA IN HEALTHCARE
1
1. Use a Comprehensive Data
Ingestion Framework
While working with a Big Data lake, you need to
integrate numerous source systems with multiple feed
types. Your Big Data solution should have the ability to
handle different feed types, and cater to future source
system integration needs. Design a data ingestion
framework that addresses:
 All types of data: relational, semi-structured and
unstructured
 Standard feed protocols: HTTPS, SFTP, etc.
 Different types of loading scenarios: including
initial load and incremental load
 ELT (Extract, Load, Transform) approach as
compared to traditional ETL
 Various ingestion frequencies: batch, real-time
 Relevant data ingestion mechanisms (push or pull).
Pulling data may not be preferable when data lake
is on the cloud and sources reside on-premise
Big Data Layers: Typical Architecture
2
DataMonitoringLayer
DataSecurityLayer
Data Visualization Layer
Data Ingestion Layer
S3 HDFS Cluster
Data
Sources
Data
Storage
Data
Processing
Batch
Processing
Real-time
Processing
Querying&
AnalyticsEngine
Data
Query
Layer
Statistical Analytics
Semantic Analytics
Predictive Modelling
Dash-
boards
&
Reports
GUIDING PRINCIPLES FOR BIG DATA IMPLEMENTATION
3
2. Choose the Right Storage Type for
Each Feed
Since Big Data ecosystems provide multiple storage
components, it gives the opportunity to use relevant
and optimal storage type for a specific feed. The
following points need to be considered while choosing
a storage type:
 Feed attributes: e.g., total size of data, size of
individual files, velocity at which data arrives, etc.
 Data ingestion system: Ability to identify whether
the data ingested is small or big in size
 Database architecture: Based on size, data can be
stored in distributed file systems, cloud storage or
in NoSQL / columnar data bases. For example:
• Files of 128MB and above (default Hadoop
block size), can be stored in HDFS. Small files (in
KBs) can be stored in Hadoop sequence files, or
in HBase
• JSON data can be stored in document database
3. Create Separate Storage Layers
Organizations starting their Big Data implementations
often ask, “How do we arrange data in a Data Lake?”
and “How many layers should we create?”. The
answers depend on the type of data being pulled and
processed in the Data Lake. In a standard scenario,
customers want to correlate data from relational
systems, IoT devices, social media and unstructured
data sources, e.g. notes, images, documents etc. In
such scenarios, a three layer approach can be used.
Raw Layer
Although not mandatory , it is always advisable to
store data in its native form in the Data Lake. This
forms the raw layer or raw zone of the Data Lake. The
raw layer is generally referred by data scientists or
analysts to perform analysis instead of waiting for
operational data.
Curated Layer
While the raw layer is important from a raw analytics
and reprocessing perspective, it isn’t the most
4
optimal way to store data, as it may contain duplicate,
incorrect or incomplete records. It is always advisable
to create a curated data layer that has cleansed and
standardized data. Analytics performed on the curated
layer provides much more accurate results than the
raw layer.
Operational Layer
Data stored in the curated layer isn’t reconciled and
continues to have the context of the source system.
This poses analytics challenges and also has the
possibility of duplicate records being sourced. The
operational layer solves this problem by reconciling
and transforming incoming data from different
sources into a single, canonical model.
4. Use the Right Data Processing
Frameworks & Tools
Identifying the right data processing framework can be
difficult as there are multiple processing frameworks in
the Big Data ecosystem. Common data processing
tasks like data cleansing, quality reporting,
aggregation, transformation and reconciliation can be
performed by standard ETL tools. However, for Big
Data processing, most standard ETL tools use Apache
Spark. While these ETL tools provide drag-drop UI and
out-of-the-box adapters, the internal working is
abstracted, making them difficult to operate in certain
scenarios.
Commonly used ETL tools are Talend Enterprise,
Pentaho, Informatica, DataStage and Attunity. For
simple data processing needs, IT teams can create a
custom ETL utility using Apache Spark and its in-built
transformation functions.
The following best practices need to be kept in mind
while working on data processing frameworks:
 Big Data processing happens in a distributed
manner. It is necessary to arrange data to minimize
shuffling and optimizing performance. Use
compression to speed up data transfer over
network and reduce shuffling time
5
 Joins are expensive in Big Data and should be
thoughtfully implemented.You can also improve
performance by de-normalizing records
 Use parameters like batch id, date range or specific
set to overcome bad / corrupt data issues
 Keep track of events (meta-data, audits) during
data processing, e.g., who triggered the process,
which dataset was used for processing, size of the
dataset, count of records processed, status of
processing, start and finish time, etc.
 Be practical with partitioning. Distributed
processing often fails to take full advantage of the
nodes due to small or numerous partitions
 For stream processing, create enough partitions on
a Kafka Topic to trigger parallel processing in
Apache Spark. Provide checkpoints at regular
intervals to minimize stream processing failure
5. Think of Data Management Right at
the Beginning
With business environments changing rapidly,
organizations need to consider data management as a
critical component of their business strategy. The
organization’s data strategy is affected by multiple
scenarios, including:
 Changes in organization or technology
 Process and people changes due to mergers and
acquisitions
 Changes in regulatory compliance or contractual
arrangements
 Issues with quality / availability / timelines of data
that affect decision making
 Massive investments in time and resources
required to get data in correct shape
To overcome these challenges, organizations must
start thinking of data management solutions right
from project inception.
Few frameworks provide data management
capabilities for Big Data, e.g., Apache Atlas with
Apache Falcon for Hortonworks, Cloudera Navigator
has partial functionality, MapR uses a custom
framework.
6
7 Pillars of Data Management
1. Data Architecture: Data analysis, enterprise data
architecture, integration with applications
2. Content Management: Organizing, consolidating
and optimizing content
3. Data Development: Requirement analysis, data
modelling, database design, implementation and
maintenance
4. Master Data and Metadata Management: Master
patient index, master provider index, master facility
index, ICD 9/10, CPT, SNOMED, LOINC, DRG and
standards, common codes, integration metadata,
control metadata, quality metadata
5. Data Quality: Measurement, assessment and
improvement in data quality
6. Operations Management: Acquisition, recovery,
tuning, retention and purging
7. Data security: Classification, administration,
privacy and confidentiality, authentication and
auditing
6. Provide a Sophisticated Search
Capability
The search feature becomes essential to Big Data
systems due to high volumes of data. Searching for
specific attribute values is like finding a needle in a
haystack. As entities are added / updated / removed
from the Data Lake, there must be a way to quickly
search and get a view of the entities present and
quickly search for specific attribute values.
Its always beneficial to index your data and provide a
search UI for quick discovery. Consider providing a
facility to tag attributes to make it searchable and
allow users to group attributes using tags.
7. Simplify Data Access Using APIs and
Data Virtualization
All data warehousing / Data Lake projects need to
provide data extracts to downstream / external
systems, and allow users to search data and enable
analytics systems to connect and analyze data using
7
standard interfaces. Most of these requirements can
be fulfilled by a thin API access layer that provides
unified access to the underlying data. The API layer
implementation should support standard based
interfaces like REST, SQL or a combination of both.
Data extraction processes are scheduled jobs that
extract data from specific tables and store it in a
shared location (e.g., SFTP). A low priority processing
queue can be used for data extraction during peak
hour to ensure the extraction query does not consume
all processing resources. Additionally, data
virtualization software (e.g., Denodo) or custom data
virtualization layer (using Apache Ignite and Spark) can
be used to create a common interface for Data Lakes
and other source systems.
8. Provide an Analytics Workspace for
Advanced Users
With the evolution of Big Data and Data Lakes, more
organizations are adopting advanced analytics tools
and technologies – e.g., Predictive Analytics, Machine
Learning, Deep Learning, Natural Language Processing
and AI algorithms. These technologies require
extensive piloting, model operationalization and
custom dashboarding before they can be applied in
real-world scenarios.
Data scientists and analysts need a dedicated
workspace and desired toolsets to pull, process,
analyze raw, curated and aggregated data, and share
their findings. They should be able to perform
activities like preliminary analysis, identifying new
trends and quick dashboarding, without affecting the
Data Lake.
An analytics workspace can be implemented in one of
the following ways:
A. Use Existing Data Lake Infrastructure to Carve
Out Space for Individual Data Scientists
This option uses the existing Data Lake infrastructure
to create slots for individual data scientists where they
can to play with a copy of the data using various tools
e.g. Apache Spark based note books.
B. Use a Separate Cluster for Each Data Scientist
This option creates separate infrastructure for
individual users and pulls data from the Data Lake.
This option may prove costlier but provides a true
multitenant architecture and ensures that the system
performance is always optimal.
Data Ingestion
Highly configurable data ingestion pipeline that caters to structured, unstructured and
semi-structured data ingestion, using Big Data ecosystem components like Sqoop,
Flume, etc. Also provides real-time data ingestion-streaming using Apache Kafka and
Storm based scalable ingestion cum processing pipeline.
Storage Types
Configurable data ingestion pipeline - dynamically chooses storage (HDFS or HBase)
based on data attributes.
Storage Layers
Ability to configure and execute data transformation and reconciliation rules using a
self-service UI. CitiusTech’s healthcare data model can be used to create canonical data
model in operational layer.
Data Processing
Highly configurable and easy-to-use data processing pipeline built on top of Apache
Spark to perform data validation, curation, transformation and reconciliation. Data
processing pipeline improves time-to-market for customers by quickly integrating data
from various sources.
8
CitiusTech’s H-Scale platform for healthcare data management has been specifically designed to address healthcare
Big Data challenges such as data acquisition, real-time processing, Master Data Management, data security and
advanced analytics. Here is how H-Scale supports the Big Data requirements discussed in this paper.
H-SCALE ADDRESSES KEY HEALTHCARE BIG DATA NEEDS
Data
Management
Data governance adapters to capture data lineage and auditing information. H-Scale
data governance adapters can be used while working with Apache Atlas on
Hortonworks Data Platform (HDP) and Cloudera Navigator when working with
Cloudera Hadoop Distribution (CDH).
Search
Apache Solr indexing framework to index specific tables for fast search. It also provides
tag-based logical grouping facility for searching all occurrences of specific groups.
Data
Access
Apache Spark and Ignite based data virtualization platform which can connect to
different sources without replicating data. Data virtualization processes use source
catalogue to join data at runtime without replication.
Analytics
Workspace
Big Data analytics workspace that provides self-service UI, Zeppelin based notebook
and tools for creating data processing pipeline.
9
H-SCALE ADDRESSES KEY HEALTHCARE BIG DATA NEEDS
REFERENCES
10
As healthcare organizations worldwide begin to roll out their Big Data
strategies, they will face a number of challenges along the way. With
the right initial approach, organizations can create more robust
strategies which enable them to leverage their Big Data assets more
effectively.
Our experience with Big Data implementations puts us in a strong
position to define and articulate best practices for healthcare Big Data
implementation. CitiusTech’s H-Scale platform for healthcare data
management has been aligned to fit seamlessly with the healthcare
industry’s Big Data implementation needs.
 https://atlas.apache.org/
 https://www.redoxengine.com/blog/how-to-do-microservice-
chassis-and-microservice-scaffolding-on-a-budget-2/
CONCLUSION
11
ABOUT THE AUTHORS
Pawan Mathur
Senior Technical Specialist – Data Management Proficiency, CitiusTech
Pawan.mathur@citiustech.com
Pawan has 20+ years of experience in the IT industry. He has extensive experience in software
development using Big Data Flink-Spark-Hadoop and Analytics. He has played the role of Senior
Architect in the development and implementation of CitiusTech’s H-Scale platform. He holds a
degree in Software Enterprise Management from the Indian Institute of Management, Bangalore.
Swanand Prabhutendolkar
Vice President – Data Science Proficiency, CitiusTech
Swanand.Prabhutendolkar@citiustech.com
Swanand leads the Data Management Proficiency at CitiusTech which includes the Healthcare
Interoperability, BI-DW and Big Data practices. He has 20+ years of experience in the IT industry.,
of which 11+ years are in healthcare analytics and data management. Prior to CitiusTech
Swanand served leading technology organizations such as EPIC Corporation, Polaris and 3i
Infotech. He holds a Master of Science degree in Information Technology and Applied Statistics
from the Indian Institute of Technology (IIT), Bombay.
CitiusTech is a specialist provider of healthcare technology services and
solutions to healthcare technology companies, providers, payers and life
sciences organizations. With over 3,200 professionals worldwide,
CitiusTech enables healthcare organizations to drive clinical value chain
excellence - across integration & interoperability, data management
(EDW, Big Data), performance management (BI / analytics), predictive
analytics & data science and digital engagement (mobile, IoT).
CitiusTech helps customers accelerate innovation in healthcare through
specialized solutions, healthcare technology platforms, proficiencies and
accelerators. With cutting-edge technology expertise, world-class service
quality and a global resource base, CitiusTech consistently delivers best-
in-class solutions and an unmatched cost advantage to healthcare
organizations worldwide.
For queries contact thoughtleaders@citiustech.com
Copyright © CitiusTech 2018. All Rights Reserved.

More Related Content

What's hot

5 Shades of Analytics - Presentation Version - Distributable Version
5 Shades of Analytics - Presentation Version - Distributable Version5 Shades of Analytics - Presentation Version - Distributable Version
5 Shades of Analytics - Presentation Version - Distributable VersionMichael Josephs
 
Få overblik over IT/OT-systemer og opgraderingsbehov, Leif Poulsen - NNE Phar...
Få overblik over IT/OT-systemer og opgraderingsbehov, Leif Poulsen - NNE Phar...Få overblik over IT/OT-systemer og opgraderingsbehov, Leif Poulsen - NNE Phar...
Få overblik over IT/OT-systemer og opgraderingsbehov, Leif Poulsen - NNE Phar...Mediehuset Ingeniøren Live
 
Migrating from Oracle AERS to Argus Safety: Reasons for the Move
Migrating from Oracle AERS to Argus Safety: Reasons for the MoveMigrating from Oracle AERS to Argus Safety: Reasons for the Move
Migrating from Oracle AERS to Argus Safety: Reasons for the MovePerficient, Inc.
 
Blockchain Applications in Healthcare
Blockchain Applications in HealthcareBlockchain Applications in Healthcare
Blockchain Applications in HealthcareCitiusTech
 
How to Migrate Drug Safety and Pharmacovigilance Data Cost-Effectively and wi...
How to Migrate Drug Safety and Pharmacovigilance Data Cost-Effectively and wi...How to Migrate Drug Safety and Pharmacovigilance Data Cost-Effectively and wi...
How to Migrate Drug Safety and Pharmacovigilance Data Cost-Effectively and wi...Perficient
 
Integrating Oracle Argus Safety with other Clinical Systems Using Argus Inter...
Integrating Oracle Argus Safety with other Clinical Systems Using Argus Inter...Integrating Oracle Argus Safety with other Clinical Systems Using Argus Inter...
Integrating Oracle Argus Safety with other Clinical Systems Using Argus Inter...Perficient
 
From Disaster to Recovery: Preparing Your IT for the Unexpected
From Disaster to Recovery: Preparing Your IT for the UnexpectedFrom Disaster to Recovery: Preparing Your IT for the Unexpected
From Disaster to Recovery: Preparing Your IT for the UnexpectedDataCore Software
 
Solvency II Data Management Handbook
Solvency II Data Management HandbookSolvency II Data Management Handbook
Solvency II Data Management HandbookConor Coughlan
 
Clinical Trial Management System Implementation Guide
Clinical Trial Management System Implementation GuideClinical Trial Management System Implementation Guide
Clinical Trial Management System Implementation GuidePerficient, Inc.
 
Automating Patient Management with ApplicationXtender Workflow
Automating Patient Management with ApplicationXtender WorkflowAutomating Patient Management with ApplicationXtender Workflow
Automating Patient Management with ApplicationXtender WorkflowChristopher Wynder
 
Ensuring document control for healthcare vendors
Ensuring document control for healthcare vendorsEnsuring document control for healthcare vendors
Ensuring document control for healthcare vendorsChristopher Wynder
 
Paetec Data Center Colocation Presentation
Paetec Data Center Colocation PresentationPaetec Data Center Colocation Presentation
Paetec Data Center Colocation Presentationtbunten
 
Data Security Service Offering-v3
Data Security Service Offering-v3Data Security Service Offering-v3
Data Security Service Offering-v3Abe Newton
 
NetFlow Monitoring Standard Content Guide for ESM 6.5c
NetFlow Monitoring Standard Content Guide for ESM 6.5c	NetFlow Monitoring Standard Content Guide for ESM 6.5c
NetFlow Monitoring Standard Content Guide for ESM 6.5c Protect724migration
 
7p EHR Presentation
7p EHR Presentation7p EHR Presentation
7p EHR PresentationHunt Russell
 
Diaspark Healthcare Technology Services
Diaspark Healthcare Technology ServicesDiaspark Healthcare Technology Services
Diaspark Healthcare Technology ServicesDiaspark
 
intel_soae-h_data_sheet
intel_soae-h_data_sheetintel_soae-h_data_sheet
intel_soae-h_data_sheetAlan Boucher
 

What's hot (20)

5 Shades of Analytics - Presentation Version - Distributable Version
5 Shades of Analytics - Presentation Version - Distributable Version5 Shades of Analytics - Presentation Version - Distributable Version
5 Shades of Analytics - Presentation Version - Distributable Version
 
Få overblik over IT/OT-systemer og opgraderingsbehov, Leif Poulsen - NNE Phar...
Få overblik over IT/OT-systemer og opgraderingsbehov, Leif Poulsen - NNE Phar...Få overblik over IT/OT-systemer og opgraderingsbehov, Leif Poulsen - NNE Phar...
Få overblik over IT/OT-systemer og opgraderingsbehov, Leif Poulsen - NNE Phar...
 
Migrating from Oracle AERS to Argus Safety: Reasons for the Move
Migrating from Oracle AERS to Argus Safety: Reasons for the MoveMigrating from Oracle AERS to Argus Safety: Reasons for the Move
Migrating from Oracle AERS to Argus Safety: Reasons for the Move
 
Blockchain Applications in Healthcare
Blockchain Applications in HealthcareBlockchain Applications in Healthcare
Blockchain Applications in Healthcare
 
How to Migrate Drug Safety and Pharmacovigilance Data Cost-Effectively and wi...
How to Migrate Drug Safety and Pharmacovigilance Data Cost-Effectively and wi...How to Migrate Drug Safety and Pharmacovigilance Data Cost-Effectively and wi...
How to Migrate Drug Safety and Pharmacovigilance Data Cost-Effectively and wi...
 
Integrating Oracle Argus Safety with other Clinical Systems Using Argus Inter...
Integrating Oracle Argus Safety with other Clinical Systems Using Argus Inter...Integrating Oracle Argus Safety with other Clinical Systems Using Argus Inter...
Integrating Oracle Argus Safety with other Clinical Systems Using Argus Inter...
 
From Disaster to Recovery: Preparing Your IT for the Unexpected
From Disaster to Recovery: Preparing Your IT for the UnexpectedFrom Disaster to Recovery: Preparing Your IT for the Unexpected
From Disaster to Recovery: Preparing Your IT for the Unexpected
 
Solvency II Data Management Handbook
Solvency II Data Management HandbookSolvency II Data Management Handbook
Solvency II Data Management Handbook
 
Clinical Trial Management System Implementation Guide
Clinical Trial Management System Implementation GuideClinical Trial Management System Implementation Guide
Clinical Trial Management System Implementation Guide
 
Automating Patient Management with ApplicationXtender Workflow
Automating Patient Management with ApplicationXtender WorkflowAutomating Patient Management with ApplicationXtender Workflow
Automating Patient Management with ApplicationXtender Workflow
 
Ensuring document control for healthcare vendors
Ensuring document control for healthcare vendorsEnsuring document control for healthcare vendors
Ensuring document control for healthcare vendors
 
Paetec Data Center Colocation Presentation
Paetec Data Center Colocation PresentationPaetec Data Center Colocation Presentation
Paetec Data Center Colocation Presentation
 
Data Security Service Offering-v3
Data Security Service Offering-v3Data Security Service Offering-v3
Data Security Service Offering-v3
 
Ibm and zato health
Ibm and zato healthIbm and zato health
Ibm and zato health
 
Health IT Services
Health IT ServicesHealth IT Services
Health IT Services
 
MTech- Viva_Voce
MTech- Viva_VoceMTech- Viva_Voce
MTech- Viva_Voce
 
NetFlow Monitoring Standard Content Guide for ESM 6.5c
NetFlow Monitoring Standard Content Guide for ESM 6.5c	NetFlow Monitoring Standard Content Guide for ESM 6.5c
NetFlow Monitoring Standard Content Guide for ESM 6.5c
 
7p EHR Presentation
7p EHR Presentation7p EHR Presentation
7p EHR Presentation
 
Diaspark Healthcare Technology Services
Diaspark Healthcare Technology ServicesDiaspark Healthcare Technology Services
Diaspark Healthcare Technology Services
 
intel_soae-h_data_sheet
intel_soae-h_data_sheetintel_soae-h_data_sheet
intel_soae-h_data_sheet
 

Similar to 8 Guiding Principles to Kickstart Your Healthcare Big Data Project

Data warehouse
Data warehouseData warehouse
Data warehouseRajThakuri
 
Stream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White PaperStream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White PaperImpetus Technologies
 
Decoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdfDecoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdfDatavalley.ai
 
Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architectureRahul Chaturvedi
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewIRJET Journal
 
Got data?… now what? An introduction to modern data platforms
Got data?… now what?  An introduction to modern data platformsGot data?… now what?  An introduction to modern data platforms
Got data?… now what? An introduction to modern data platformsJamesAnderson599331
 
Chapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptxChapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptxJethroDignadice2
 
IT for Management On-Demand Strategies for Performance, Growth,.docx
IT for Management On-Demand Strategies for Performance, Growth,.docxIT for Management On-Demand Strategies for Performance, Growth,.docx
IT for Management On-Demand Strategies for Performance, Growth,.docxvrickens
 
Enterprise Data Lake
Enterprise Data LakeEnterprise Data Lake
Enterprise Data Lakesambiswal
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digitalsambiswal
 
An Overview of Data Lake
An Overview of Data LakeAn Overview of Data Lake
An Overview of Data LakeIRJET Journal
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
Security issues in big data
Security issues in big data Security issues in big data
Security issues in big data Shallote Dsouza
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data LakeMetroStar
 

Similar to 8 Guiding Principles to Kickstart Your Healthcare Big Data Project (20)

Data warehouse
Data warehouseData warehouse
Data warehouse
 
Stream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White PaperStream Meets Batch for Smarter Analytics- Impetus White Paper
Stream Meets Batch for Smarter Analytics- Impetus White Paper
 
U - 2 Emerging.pptx
U - 2 Emerging.pptxU - 2 Emerging.pptx
U - 2 Emerging.pptx
 
Benefits of a data lake
Benefits of a data lake Benefits of a data lake
Benefits of a data lake
 
Big data and oracle
Big data and oracleBig data and oracle
Big data and oracle
 
Unit 5
Unit 5 Unit 5
Unit 5
 
BD1.pptx
BD1.pptxBD1.pptx
BD1.pptx
 
Decoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdfDecoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdf
 
Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architecture
 
Data Mining
Data MiningData Mining
Data Mining
 
Big Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A ReviewBig Data Processing with Hadoop : A Review
Big Data Processing with Hadoop : A Review
 
Got data?… now what? An introduction to modern data platforms
Got data?… now what?  An introduction to modern data platformsGot data?… now what?  An introduction to modern data platforms
Got data?… now what? An introduction to modern data platforms
 
Chapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptxChapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptx
 
IT for Management On-Demand Strategies for Performance, Growth,.docx
IT for Management On-Demand Strategies for Performance, Growth,.docxIT for Management On-Demand Strategies for Performance, Growth,.docx
IT for Management On-Demand Strategies for Performance, Growth,.docx
 
Enterprise Data Lake
Enterprise Data LakeEnterprise Data Lake
Enterprise Data Lake
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digital
 
An Overview of Data Lake
An Overview of Data LakeAn Overview of Data Lake
An Overview of Data Lake
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Security issues in big data
Security issues in big data Security issues in big data
Security issues in big data
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 

More from CitiusTech

Member Engagement Using Sentiment Analysis for Health Plans
Member Engagement Using Sentiment Analysis for Health PlansMember Engagement Using Sentiment Analysis for Health Plans
Member Engagement Using Sentiment Analysis for Health PlansCitiusTech
 
Evolving Role of Digital Biomarkers in Healthcare
Evolving Role of Digital Biomarkers in HealthcareEvolving Role of Digital Biomarkers in Healthcare
Evolving Role of Digital Biomarkers in HealthcareCitiusTech
 
Virtual Care: Key Challenges & Opportunities for Payer Organizations
Virtual Care: Key Challenges & Opportunities for Payer Organizations Virtual Care: Key Challenges & Opportunities for Payer Organizations
Virtual Care: Key Challenges & Opportunities for Payer Organizations CitiusTech
 
Provider-led Health Plans (Payviders)
Provider-led Health Plans (Payviders)Provider-led Health Plans (Payviders)
Provider-led Health Plans (Payviders)CitiusTech
 
CMS Medicare Advantage 2021 Star Ratings: An Analysis
CMS Medicare Advantage 2021 Star Ratings: An AnalysisCMS Medicare Advantage 2021 Star Ratings: An Analysis
CMS Medicare Advantage 2021 Star Ratings: An AnalysisCitiusTech
 
FHIR for Life Sciences
FHIR for Life SciencesFHIR for Life Sciences
FHIR for Life SciencesCitiusTech
 
Leveraging Analytics to Identify High Risk Patients
Leveraging Analytics to Identify High Risk PatientsLeveraging Analytics to Identify High Risk Patients
Leveraging Analytics to Identify High Risk PatientsCitiusTech
 
FHIR Adoption Framework for Payers
FHIR Adoption Framework for PayersFHIR Adoption Framework for Payers
FHIR Adoption Framework for PayersCitiusTech
 
Payer-Provider Engagement
Payer-Provider Engagement Payer-Provider Engagement
Payer-Provider Engagement CitiusTech
 
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021CitiusTech
 
RPA in Healthcare
RPA in HealthcareRPA in Healthcare
RPA in HealthcareCitiusTech
 
6 Epilepsy Use Cases for NLP
6 Epilepsy Use Cases for NLP6 Epilepsy Use Cases for NLP
6 Epilepsy Use Cases for NLPCitiusTech
 
Opioid Epidemic - Causes, Impact and Future
Opioid Epidemic - Causes, Impact and FutureOpioid Epidemic - Causes, Impact and Future
Opioid Epidemic - Causes, Impact and FutureCitiusTech
 
Rising Importance of Health Economics & Outcomes Research
Rising Importance of Health Economics & Outcomes ResearchRising Importance of Health Economics & Outcomes Research
Rising Importance of Health Economics & Outcomes ResearchCitiusTech
 
ICD 11: Impact on Payer Market
ICD 11: Impact on Payer MarketICD 11: Impact on Payer Market
ICD 11: Impact on Payer MarketCitiusTech
 
Testing Strategies for Data Lake Hosted on Hadoop
Testing Strategies for Data Lake Hosted on HadoopTesting Strategies for Data Lake Hosted on Hadoop
Testing Strategies for Data Lake Hosted on HadoopCitiusTech
 
Driving Home Health Efficiency through Data Analytics
Driving Home Health Efficiency through Data AnalyticsDriving Home Health Efficiency through Data Analytics
Driving Home Health Efficiency through Data AnalyticsCitiusTech
 
Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...
Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...
Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...CitiusTech
 
CMS’ New Interoperability and Patient Access Proposed Rule - Top 5 Payer Impacts
CMS’ New Interoperability and Patient Access Proposed Rule - Top 5 Payer ImpactsCMS’ New Interoperability and Patient Access Proposed Rule - Top 5 Payer Impacts
CMS’ New Interoperability and Patient Access Proposed Rule - Top 5 Payer ImpactsCitiusTech
 
UX Design to Improve User Productivity in Healthcare Registries
UX Design to Improve User Productivity in Healthcare RegistriesUX Design to Improve User Productivity in Healthcare Registries
UX Design to Improve User Productivity in Healthcare RegistriesCitiusTech
 

More from CitiusTech (20)

Member Engagement Using Sentiment Analysis for Health Plans
Member Engagement Using Sentiment Analysis for Health PlansMember Engagement Using Sentiment Analysis for Health Plans
Member Engagement Using Sentiment Analysis for Health Plans
 
Evolving Role of Digital Biomarkers in Healthcare
Evolving Role of Digital Biomarkers in HealthcareEvolving Role of Digital Biomarkers in Healthcare
Evolving Role of Digital Biomarkers in Healthcare
 
Virtual Care: Key Challenges & Opportunities for Payer Organizations
Virtual Care: Key Challenges & Opportunities for Payer Organizations Virtual Care: Key Challenges & Opportunities for Payer Organizations
Virtual Care: Key Challenges & Opportunities for Payer Organizations
 
Provider-led Health Plans (Payviders)
Provider-led Health Plans (Payviders)Provider-led Health Plans (Payviders)
Provider-led Health Plans (Payviders)
 
CMS Medicare Advantage 2021 Star Ratings: An Analysis
CMS Medicare Advantage 2021 Star Ratings: An AnalysisCMS Medicare Advantage 2021 Star Ratings: An Analysis
CMS Medicare Advantage 2021 Star Ratings: An Analysis
 
FHIR for Life Sciences
FHIR for Life SciencesFHIR for Life Sciences
FHIR for Life Sciences
 
Leveraging Analytics to Identify High Risk Patients
Leveraging Analytics to Identify High Risk PatientsLeveraging Analytics to Identify High Risk Patients
Leveraging Analytics to Identify High Risk Patients
 
FHIR Adoption Framework for Payers
FHIR Adoption Framework for PayersFHIR Adoption Framework for Payers
FHIR Adoption Framework for Payers
 
Payer-Provider Engagement
Payer-Provider Engagement Payer-Provider Engagement
Payer-Provider Engagement
 
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
 
RPA in Healthcare
RPA in HealthcareRPA in Healthcare
RPA in Healthcare
 
6 Epilepsy Use Cases for NLP
6 Epilepsy Use Cases for NLP6 Epilepsy Use Cases for NLP
6 Epilepsy Use Cases for NLP
 
Opioid Epidemic - Causes, Impact and Future
Opioid Epidemic - Causes, Impact and FutureOpioid Epidemic - Causes, Impact and Future
Opioid Epidemic - Causes, Impact and Future
 
Rising Importance of Health Economics & Outcomes Research
Rising Importance of Health Economics & Outcomes ResearchRising Importance of Health Economics & Outcomes Research
Rising Importance of Health Economics & Outcomes Research
 
ICD 11: Impact on Payer Market
ICD 11: Impact on Payer MarketICD 11: Impact on Payer Market
ICD 11: Impact on Payer Market
 
Testing Strategies for Data Lake Hosted on Hadoop
Testing Strategies for Data Lake Hosted on HadoopTesting Strategies for Data Lake Hosted on Hadoop
Testing Strategies for Data Lake Hosted on Hadoop
 
Driving Home Health Efficiency through Data Analytics
Driving Home Health Efficiency through Data AnalyticsDriving Home Health Efficiency through Data Analytics
Driving Home Health Efficiency through Data Analytics
 
Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...
Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...
Poster Presentation - FDA Compliance Landscape & What it Means to Your AI Asp...
 
CMS’ New Interoperability and Patient Access Proposed Rule - Top 5 Payer Impacts
CMS’ New Interoperability and Patient Access Proposed Rule - Top 5 Payer ImpactsCMS’ New Interoperability and Patient Access Proposed Rule - Top 5 Payer Impacts
CMS’ New Interoperability and Patient Access Proposed Rule - Top 5 Payer Impacts
 
UX Design to Improve User Productivity in Healthcare Registries
UX Design to Improve User Productivity in Healthcare RegistriesUX Design to Improve User Productivity in Healthcare Registries
UX Design to Improve User Productivity in Healthcare Registries
 

Recently uploaded

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 

Recently uploaded (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

8 Guiding Principles to Kickstart Your Healthcare Big Data Project

  • 1. December 2018 8 Guiding Principles to Kickstart Your Healthcare Big Data Project White Paper
  • 2. Big Data technologies have seen widespread adoption across different industries over the past 3-5 years, but the healthcare is just starting to realize the benefits. This is mainly due to the exponential growth of unstructured and semi-structured healthcare information. With sensors and wearables becoming a part of our daily lives, people and organizations now have access to enormous amounts of data, e.g., step tracking, heartbeat / blood pressure monitoring, calorie tracking, sleep pattern analysis, etc. The explosion in healthcare data, while posing massive storage and processing challenges, also has the potential to transform the way we use data to improve outcomes, for example:  Predicting future care needs for specific populations  Minimizing health risks by predicting specific events well in advance  Identifying / expediting process of identifying new patterns in disease detection, etc. Our experience with a large number of healthcare Big Data projects has shown that most customers face significant hurdles in kick-starting their Big Data initiatives. With limited or no experience, customers often realize last-minute that their Big Data project implementations don’t have the architectural robustness to address future needs. This white paper illustrates our experiences and learnings across multiple Big Data implementation projects. It contains a broad set of guidelines and best practices around:  Building highly secure Big Data lakes  Efficiently processing vast amounts of data  Providing access to downstream systems  Best practices to mitigate project risks  Technical hurdles and approaches to overcome them OVERVIEW OF BIG DATA IN HEALTHCARE 1
  • 3. 1. Use a Comprehensive Data Ingestion Framework While working with a Big Data lake, you need to integrate numerous source systems with multiple feed types. Your Big Data solution should have the ability to handle different feed types, and cater to future source system integration needs. Design a data ingestion framework that addresses:  All types of data: relational, semi-structured and unstructured  Standard feed protocols: HTTPS, SFTP, etc.  Different types of loading scenarios: including initial load and incremental load  ELT (Extract, Load, Transform) approach as compared to traditional ETL  Various ingestion frequencies: batch, real-time  Relevant data ingestion mechanisms (push or pull). Pulling data may not be preferable when data lake is on the cloud and sources reside on-premise Big Data Layers: Typical Architecture 2 DataMonitoringLayer DataSecurityLayer Data Visualization Layer Data Ingestion Layer S3 HDFS Cluster Data Sources Data Storage Data Processing Batch Processing Real-time Processing Querying& AnalyticsEngine Data Query Layer Statistical Analytics Semantic Analytics Predictive Modelling Dash- boards & Reports GUIDING PRINCIPLES FOR BIG DATA IMPLEMENTATION
  • 4. 3 2. Choose the Right Storage Type for Each Feed Since Big Data ecosystems provide multiple storage components, it gives the opportunity to use relevant and optimal storage type for a specific feed. The following points need to be considered while choosing a storage type:  Feed attributes: e.g., total size of data, size of individual files, velocity at which data arrives, etc.  Data ingestion system: Ability to identify whether the data ingested is small or big in size  Database architecture: Based on size, data can be stored in distributed file systems, cloud storage or in NoSQL / columnar data bases. For example: • Files of 128MB and above (default Hadoop block size), can be stored in HDFS. Small files (in KBs) can be stored in Hadoop sequence files, or in HBase • JSON data can be stored in document database 3. Create Separate Storage Layers Organizations starting their Big Data implementations often ask, “How do we arrange data in a Data Lake?” and “How many layers should we create?”. The answers depend on the type of data being pulled and processed in the Data Lake. In a standard scenario, customers want to correlate data from relational systems, IoT devices, social media and unstructured data sources, e.g. notes, images, documents etc. In such scenarios, a three layer approach can be used. Raw Layer Although not mandatory , it is always advisable to store data in its native form in the Data Lake. This forms the raw layer or raw zone of the Data Lake. The raw layer is generally referred by data scientists or analysts to perform analysis instead of waiting for operational data. Curated Layer While the raw layer is important from a raw analytics and reprocessing perspective, it isn’t the most
  • 5. 4 optimal way to store data, as it may contain duplicate, incorrect or incomplete records. It is always advisable to create a curated data layer that has cleansed and standardized data. Analytics performed on the curated layer provides much more accurate results than the raw layer. Operational Layer Data stored in the curated layer isn’t reconciled and continues to have the context of the source system. This poses analytics challenges and also has the possibility of duplicate records being sourced. The operational layer solves this problem by reconciling and transforming incoming data from different sources into a single, canonical model. 4. Use the Right Data Processing Frameworks & Tools Identifying the right data processing framework can be difficult as there are multiple processing frameworks in the Big Data ecosystem. Common data processing tasks like data cleansing, quality reporting, aggregation, transformation and reconciliation can be performed by standard ETL tools. However, for Big Data processing, most standard ETL tools use Apache Spark. While these ETL tools provide drag-drop UI and out-of-the-box adapters, the internal working is abstracted, making them difficult to operate in certain scenarios. Commonly used ETL tools are Talend Enterprise, Pentaho, Informatica, DataStage and Attunity. For simple data processing needs, IT teams can create a custom ETL utility using Apache Spark and its in-built transformation functions. The following best practices need to be kept in mind while working on data processing frameworks:  Big Data processing happens in a distributed manner. It is necessary to arrange data to minimize shuffling and optimizing performance. Use compression to speed up data transfer over network and reduce shuffling time
  • 6. 5  Joins are expensive in Big Data and should be thoughtfully implemented.You can also improve performance by de-normalizing records  Use parameters like batch id, date range or specific set to overcome bad / corrupt data issues  Keep track of events (meta-data, audits) during data processing, e.g., who triggered the process, which dataset was used for processing, size of the dataset, count of records processed, status of processing, start and finish time, etc.  Be practical with partitioning. Distributed processing often fails to take full advantage of the nodes due to small or numerous partitions  For stream processing, create enough partitions on a Kafka Topic to trigger parallel processing in Apache Spark. Provide checkpoints at regular intervals to minimize stream processing failure 5. Think of Data Management Right at the Beginning With business environments changing rapidly, organizations need to consider data management as a critical component of their business strategy. The organization’s data strategy is affected by multiple scenarios, including:  Changes in organization or technology  Process and people changes due to mergers and acquisitions  Changes in regulatory compliance or contractual arrangements  Issues with quality / availability / timelines of data that affect decision making  Massive investments in time and resources required to get data in correct shape To overcome these challenges, organizations must start thinking of data management solutions right from project inception. Few frameworks provide data management capabilities for Big Data, e.g., Apache Atlas with Apache Falcon for Hortonworks, Cloudera Navigator has partial functionality, MapR uses a custom framework.
  • 7. 6 7 Pillars of Data Management 1. Data Architecture: Data analysis, enterprise data architecture, integration with applications 2. Content Management: Organizing, consolidating and optimizing content 3. Data Development: Requirement analysis, data modelling, database design, implementation and maintenance 4. Master Data and Metadata Management: Master patient index, master provider index, master facility index, ICD 9/10, CPT, SNOMED, LOINC, DRG and standards, common codes, integration metadata, control metadata, quality metadata 5. Data Quality: Measurement, assessment and improvement in data quality 6. Operations Management: Acquisition, recovery, tuning, retention and purging 7. Data security: Classification, administration, privacy and confidentiality, authentication and auditing 6. Provide a Sophisticated Search Capability The search feature becomes essential to Big Data systems due to high volumes of data. Searching for specific attribute values is like finding a needle in a haystack. As entities are added / updated / removed from the Data Lake, there must be a way to quickly search and get a view of the entities present and quickly search for specific attribute values. Its always beneficial to index your data and provide a search UI for quick discovery. Consider providing a facility to tag attributes to make it searchable and allow users to group attributes using tags. 7. Simplify Data Access Using APIs and Data Virtualization All data warehousing / Data Lake projects need to provide data extracts to downstream / external systems, and allow users to search data and enable analytics systems to connect and analyze data using
  • 8. 7 standard interfaces. Most of these requirements can be fulfilled by a thin API access layer that provides unified access to the underlying data. The API layer implementation should support standard based interfaces like REST, SQL or a combination of both. Data extraction processes are scheduled jobs that extract data from specific tables and store it in a shared location (e.g., SFTP). A low priority processing queue can be used for data extraction during peak hour to ensure the extraction query does not consume all processing resources. Additionally, data virtualization software (e.g., Denodo) or custom data virtualization layer (using Apache Ignite and Spark) can be used to create a common interface for Data Lakes and other source systems. 8. Provide an Analytics Workspace for Advanced Users With the evolution of Big Data and Data Lakes, more organizations are adopting advanced analytics tools and technologies – e.g., Predictive Analytics, Machine Learning, Deep Learning, Natural Language Processing and AI algorithms. These technologies require extensive piloting, model operationalization and custom dashboarding before they can be applied in real-world scenarios. Data scientists and analysts need a dedicated workspace and desired toolsets to pull, process, analyze raw, curated and aggregated data, and share their findings. They should be able to perform activities like preliminary analysis, identifying new trends and quick dashboarding, without affecting the Data Lake. An analytics workspace can be implemented in one of the following ways: A. Use Existing Data Lake Infrastructure to Carve Out Space for Individual Data Scientists This option uses the existing Data Lake infrastructure to create slots for individual data scientists where they can to play with a copy of the data using various tools e.g. Apache Spark based note books. B. Use a Separate Cluster for Each Data Scientist This option creates separate infrastructure for individual users and pulls data from the Data Lake. This option may prove costlier but provides a true multitenant architecture and ensures that the system performance is always optimal.
  • 9. Data Ingestion Highly configurable data ingestion pipeline that caters to structured, unstructured and semi-structured data ingestion, using Big Data ecosystem components like Sqoop, Flume, etc. Also provides real-time data ingestion-streaming using Apache Kafka and Storm based scalable ingestion cum processing pipeline. Storage Types Configurable data ingestion pipeline - dynamically chooses storage (HDFS or HBase) based on data attributes. Storage Layers Ability to configure and execute data transformation and reconciliation rules using a self-service UI. CitiusTech’s healthcare data model can be used to create canonical data model in operational layer. Data Processing Highly configurable and easy-to-use data processing pipeline built on top of Apache Spark to perform data validation, curation, transformation and reconciliation. Data processing pipeline improves time-to-market for customers by quickly integrating data from various sources. 8 CitiusTech’s H-Scale platform for healthcare data management has been specifically designed to address healthcare Big Data challenges such as data acquisition, real-time processing, Master Data Management, data security and advanced analytics. Here is how H-Scale supports the Big Data requirements discussed in this paper. H-SCALE ADDRESSES KEY HEALTHCARE BIG DATA NEEDS
  • 10. Data Management Data governance adapters to capture data lineage and auditing information. H-Scale data governance adapters can be used while working with Apache Atlas on Hortonworks Data Platform (HDP) and Cloudera Navigator when working with Cloudera Hadoop Distribution (CDH). Search Apache Solr indexing framework to index specific tables for fast search. It also provides tag-based logical grouping facility for searching all occurrences of specific groups. Data Access Apache Spark and Ignite based data virtualization platform which can connect to different sources without replicating data. Data virtualization processes use source catalogue to join data at runtime without replication. Analytics Workspace Big Data analytics workspace that provides self-service UI, Zeppelin based notebook and tools for creating data processing pipeline. 9 H-SCALE ADDRESSES KEY HEALTHCARE BIG DATA NEEDS
  • 11. REFERENCES 10 As healthcare organizations worldwide begin to roll out their Big Data strategies, they will face a number of challenges along the way. With the right initial approach, organizations can create more robust strategies which enable them to leverage their Big Data assets more effectively. Our experience with Big Data implementations puts us in a strong position to define and articulate best practices for healthcare Big Data implementation. CitiusTech’s H-Scale platform for healthcare data management has been aligned to fit seamlessly with the healthcare industry’s Big Data implementation needs.  https://atlas.apache.org/  https://www.redoxengine.com/blog/how-to-do-microservice- chassis-and-microservice-scaffolding-on-a-budget-2/ CONCLUSION
  • 12. 11 ABOUT THE AUTHORS Pawan Mathur Senior Technical Specialist – Data Management Proficiency, CitiusTech Pawan.mathur@citiustech.com Pawan has 20+ years of experience in the IT industry. He has extensive experience in software development using Big Data Flink-Spark-Hadoop and Analytics. He has played the role of Senior Architect in the development and implementation of CitiusTech’s H-Scale platform. He holds a degree in Software Enterprise Management from the Indian Institute of Management, Bangalore. Swanand Prabhutendolkar Vice President – Data Science Proficiency, CitiusTech Swanand.Prabhutendolkar@citiustech.com Swanand leads the Data Management Proficiency at CitiusTech which includes the Healthcare Interoperability, BI-DW and Big Data practices. He has 20+ years of experience in the IT industry., of which 11+ years are in healthcare analytics and data management. Prior to CitiusTech Swanand served leading technology organizations such as EPIC Corporation, Polaris and 3i Infotech. He holds a Master of Science degree in Information Technology and Applied Statistics from the Indian Institute of Technology (IIT), Bombay.
  • 13. CitiusTech is a specialist provider of healthcare technology services and solutions to healthcare technology companies, providers, payers and life sciences organizations. With over 3,200 professionals worldwide, CitiusTech enables healthcare organizations to drive clinical value chain excellence - across integration & interoperability, data management (EDW, Big Data), performance management (BI / analytics), predictive analytics & data science and digital engagement (mobile, IoT). CitiusTech helps customers accelerate innovation in healthcare through specialized solutions, healthcare technology platforms, proficiencies and accelerators. With cutting-edge technology expertise, world-class service quality and a global resource base, CitiusTech consistently delivers best- in-class solutions and an unmatched cost advantage to healthcare organizations worldwide. For queries contact thoughtleaders@citiustech.com Copyright © CitiusTech 2018. All Rights Reserved.