SlideShare a Scribd company logo
1 of 40
Redefining ETL Pipelines with Apache
Technologies to Accelerate Decision
Making for Clinical Trials
Eran Withana
www.comprehend.com
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
 Challenges
 Future Work
Overview
www.comprehend.com
Open Source
Member, PMC member and committer of ASF
Apache Axis2, Web Services, Synapse,
Airavata
Education
PhD in Computer Science from Indiana
University
Software engineer at Comprehend Systems
About me …
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
Clinical Trials – Lay of the land
Number of Drugs in Development Worldwide
(Source: CenterWatch Drugs in Clinical Trial
Database 2014)
Source: http://www.phrma.org/innovation/clinical-trials
www.comprehend.com
Clinical Trials – Lay of the Land
Multiple Stakeholders
• Study Managers
• Program Managers
• Monitors
• Data Managers
• Bio-statisticians
• Executives
• Medical Affairs
• Regulatory
• Vendors
• CROs
• CRAs
Sites
Labs
Patients
Safety
EDC
Reports
● Latent
● Fragmented
Data
PV Data
Excel
Sponsor
Contract Research Organization (CRO)
Sites and Investigators
www.comprehend.com
For decades, clinical development
was primarily paper-based.
www.comprehend.com
Various Software and Practices Used in Each Layer
medidata
CROs and SIs
Technologies
www.comprehend.com
Clinical Trials with Centralized Monitoring
Clinical
Operations
Sites
Labs
Patients
● Consolidated
● Real-time
● Self-Service
● Mobile
Clinical
Analytics &
Collaboration
Data
Safet
y
EDC
PV Data
Excel
www.comprehend.com
Providing up-to-date answers
Executives Medical Review
CRAs Data Management
Clinical Operations
EDC
CTMS
Safety
ePro
Other
Web
Ad-Hoc
Mobile
Collab
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
FDA, HIPAA Compliance
Metadata/Database structure synchronization
Less frequent (once a day)
Data Synchronization
More frequent (multiple times a day)
Ability to plugin various data sources
RAVE, MERGE, BioClinica, File Imports, DB-to-DB
Synchs
Real time event propagations
Adverse events (AEs) - the need for early
identification
Business Requirements
www.comprehend.com
Hardware agnostic for resiliency and better
utilization
Repeatable deployments
Real time processing and real time events
Fault Tolerance
In flight and end state metrics for alerting and
monitoring
Flexible and pluggable adapter architecture
Time travel
Audit trails
Report generations
Technical Requirements
www.comprehend.com
Events all the way
Shared event bus for multiple consumers
Use of language agnostic data
representations (via protobuf)
Automatic datacenter resources
management (Mesos/Marathon/Docker)
Core Design Principles
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
• Data processing
 Apache Storm and Trident, Apache
Spark and Spark Streaming,
Samza, Summingbird, Scalding,
Apache Falcon, Azkaban
• Coordination and Configuration
Management
 Apache Zookeeper, Redis, Apache
Curator
• Event Queue
 Apache Kafka
• Scheduling
 Chronos, Apache Mesos, Marathon,
Apache Aurora
• Database Synchronization
 Liquibase, Flyway DB
• Data Representations
 Apache Thrift, protobuf, Avro
• Deployments
 Ansible
• File Management
 Apache HDFS
• Monitoring and alerting
 Graphite, StatsD
• Database
 PostgreSQL, Apache Spark
• Resource Isolation
 LXC, Docker
Technologies Evaluated
www.comprehend.com
Data Processing Technology Evaluation
Criteria Storm +
Trident
Spark +
Streaming
Samza Summingbird Scalding Falcon Chronos Aurora Azkaban
DAG
Support
Y DAGScheduler
Y Y Y Y Y N Y
DAG Nodes
Resiliency
Y Y Y Y Y Y Y N Y
Event
Driven
Y Y Y Y N N N N N
Timed
Execution
Y Y Y Y Y Y Y Y
DAG
Extension
Y Y Y Y Y Y Y Y Y
Inflight and
end state
metrics
Y Y Y Y Y Y Y Y Y
Hardware
Agnostic
Y Y Y Y Y Y Y Y Y
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
High Level Architecture
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
Bare Metal Boxes
Partitioned using LXC containers
Use of Mesos to do the resource
allocations as needed for jobs
Managing Hardware
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
Ansible
Repeatable deployments
Password management
Inventory management
(nodes, dev/staging/production)
Deployments
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
Adapters – High Level
• Syncher is for DB structural
changes
 Syncher creates a database schema
from the source information
 Runs a generic database diff and
applies those to the target database
• Seeder is for data
synchronization
 Uses the database schema created
by Syncher
• Seeders gets jobs from
 Syncher or
 Timed scheduler
Data Adapters
• Coordination and
Configuration
through Zookeeper
Job configuration
Connection information
Distributed locking and
counters
Metric Maintenance
Last successful run
Data Adapters – Coordination and Configuration
www.comprehend.com
Data Adapters - Implementation
www.comprehend.com
 Syncher
 Connectivity to source/sink systems fail
• Retry job N times and alert, if needed
 Schema changes to the database fails in the middle
• Transaction rollback
 Seeder
 Connectivity to source/sink systems fail
• Retry job N times and alert, if needed
 If seeding fails midway
• Storm retries tuples
• Failing tuples are moved to an error queue
 Table and row level failues
• Option to skip the tables/rows but send a report at the end
 Effect on “live” tables during data synchronizations
• Option to use transactions or
• Use temporary tables and swap with original upon completion
Failure Modes
www.comprehend.com
Can bring in data from more data sources and
more studies effectively
Run real time reports on studies and configure
alerts (future)
Can configure refreshes as needed by each
use case
Can throttle input and output sources at
study/customer level
Ability to onboard new customers and deploy
new studies with minimal human intervention
What Have We Gained
www.comprehend.com
A generic framework which
eases integration with new data sources
• For each new source, implement a method to create a
virtual schema and to get data for a given table
can scale and fault tolerant
has generic monitoring and alerting
eases maintenance since its mostly generic code
notification of important events through messages
runs on any hardware
What Have We Gained
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
Accessibility
Customers must be able to drop files securely (SFTP like
functionality)
Ability to access resources through URLs
Data storage
Scalability and Redundancy
Scale-out by adding nodes
Resilience against loss of nodes, data centers and
replication
Miscellaneous
Access control over read/write
Performance/usage/resource utilization monitoring
Distributed File System - Requirements
www.comprehend.com
Two name nodes running
in HA mode, co-located
with two journal nodes
Third journal node on a
separate node
Data nodes on all bare
metal nodes
Mounting HDFS with
FUSE and enabling SFTP
through OS level features
Automatic failover through
DNS and HA Proxy
HDFS with High Availability Mode
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
Regulatory requirements
Data encryption requirements for clinical data
Audit trails
Data quality
Source system constraints
Coordination between Synchers and Seeders
Distributed locks and counters
Automatic fail over when a name node fails in
HDFS
HDFS HA mode stores active name node in ZK as a
java serialized object, yikes !!
Challenges
Clinical Trials – Lay of the land
Business and Technical Requirements
Technology Evaluation
High Level Architecture
Implementation
Managing Hardware
Deployments
Data Adapters: Implementation and Failure Modes
Distributed File System
Challenges
Future Work
Overview
www.comprehend.com
Time travel
Ability to go back in time and run reports at any
given point of time
Trail of data
Containerization
In-memory query execution with Apache
Spark
Future Work
www.comprehend.com
Team
www.comprehend.com
Thank You !!
Questions …

More Related Content

What's hot

Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science ServicesIan Foster
 
Scientific workflow-overview-2012-01-rev-2
Scientific workflow-overview-2012-01-rev-2Scientific workflow-overview-2012-01-rev-2
Scientific workflow-overview-2012-01-rev-2Terence Critchlow
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilitiesIan Foster
 
Semantics in Sensor Networks
Semantics in Sensor NetworksSemantics in Sensor Networks
Semantics in Sensor NetworksOscar Corcho
 
Integrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloudIntegrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloudData Finder
 
Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...Ola Spjuth
 
Big data and open access: a collision course for science
Big data and open access: a collision course for scienceBig data and open access: a collision course for science
Big data and open access: a collision course for scienceBeth Plale
 
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataThe DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataAnubhav Jain
 
The Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningThe Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningRafael Ferreira da Silva
 
Analysis of User Submission Behavior on HPC and HTC
Analysis of User Submission Behavior on HPC and HTCAnalysis of User Submission Behavior on HPC and HTC
Analysis of User Submission Behavior on HPC and HTCRafael Ferreira da Silva
 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralPaolo Missier
 
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...inside-BigData.com
 
CHASE-CI: A Distributed Big Data Machine Learning Platform
CHASE-CI: A Distributed Big Data Machine Learning PlatformCHASE-CI: A Distributed Big Data Machine Learning Platform
CHASE-CI: A Distributed Big Data Machine Learning PlatformLarry Smarr
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
 
Mark_Yashar_Resume_2017
Mark_Yashar_Resume_2017Mark_Yashar_Resume_2017
Mark_Yashar_Resume_2017Mark Yashar
 
Method for conducting a combined analysis of grid environment’s fta and gwa t...
Method for conducting a combined analysis of grid environment’s fta and gwa t...Method for conducting a combined analysis of grid environment’s fta and gwa t...
Method for conducting a combined analysis of grid environment’s fta and gwa t...ijgca
 

What's hot (18)

Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science Services
 
Scientific workflow-overview-2012-01-rev-2
Scientific workflow-overview-2012-01-rev-2Scientific workflow-overview-2012-01-rev-2
Scientific workflow-overview-2012-01-rev-2
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
Semantics in Sensor Networks
Semantics in Sensor NetworksSemantics in Sensor Networks
Semantics in Sensor Networks
 
Journals analysis ppt
Journals analysis pptJournals analysis ppt
Journals analysis ppt
 
Integrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloudIntegrating scientific laboratories into the cloud
Integrating scientific laboratories into the cloud
 
Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...
 
Big data and open access: a collision course for science
Big data and open access: a collision course for scienceBig data and open access: a collision course for science
Big data and open access: a collision course for science
 
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataThe DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
 
The Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource ProvisioningThe Interplay of Workflow Execution and Resource Provisioning
The Interplay of Workflow Execution and Resource Provisioning
 
Analysis of User Submission Behavior on HPC and HTC
Analysis of User Submission Behavior on HPC and HTCAnalysis of User Submission Behavior on HPC and HTC
Analysis of User Submission Behavior on HPC and HTC
 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
 
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
The Transformation of HPC: Simulation and Cognitive Methods in the Era of Big...
 
CHASE-CI: A Distributed Big Data Machine Learning Platform
CHASE-CI: A Distributed Big Data Machine Learning PlatformCHASE-CI: A Distributed Big Data Machine Learning Platform
CHASE-CI: A Distributed Big Data Machine Learning Platform
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 
CV_myashar_2017
CV_myashar_2017CV_myashar_2017
CV_myashar_2017
 
Mark_Yashar_Resume_2017
Mark_Yashar_Resume_2017Mark_Yashar_Resume_2017
Mark_Yashar_Resume_2017
 
Method for conducting a combined analysis of grid environment’s fta and gwa t...
Method for conducting a combined analysis of grid environment’s fta and gwa t...Method for conducting a combined analysis of grid environment’s fta and gwa t...
Method for conducting a combined analysis of grid environment’s fta and gwa t...
 

Viewers also liked

A CTTI Survey of Current Monitoring Practices
A CTTI Survey of Current Monitoring PracticesA CTTI Survey of Current Monitoring Practices
A CTTI Survey of Current Monitoring PracticesTarget Health, Inc.
 
Developing Protocols & Procedures for CT Data Integrity
Developing Protocols & Procedures for CT Data Integrity Developing Protocols & Procedures for CT Data Integrity
Developing Protocols & Procedures for CT Data Integrity Bhaswat Chakraborty
 
Clinical Trials Glossary
Clinical Trials GlossaryClinical Trials Glossary
Clinical Trials GlossarySunilindia07
 
Clinical Trials in India
Clinical Trials in IndiaClinical Trials in India
Clinical Trials in Indiavaatsalya
 
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopApache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopDataWorks Summit
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinPietro Michiardi
 
Scope of pharmacology
Scope of pharmacologyScope of pharmacology
Scope of pharmacologyjireankita
 
Ethical Considerations In Clinical Trials
Ethical  Considerations In  Clinical  TrialsEthical  Considerations In  Clinical  Trials
Ethical Considerations In Clinical Trialskrathishbopanna
 
Clinical Trials Introduction
Clinical Trials IntroductionClinical Trials Introduction
Clinical Trials Introductionbiinoida
 
Clinical trials flow process
Clinical trials flow processClinical trials flow process
Clinical trials flow processTamer Hifnawy
 
Imaging biomarkers in Clinical trials
Imaging biomarkers in Clinical trialsImaging biomarkers in Clinical trials
Imaging biomarkers in Clinical trialsAnke Maertens
 
Monitoring and auditing in clinical trials
Monitoring and auditing in clinical trialsMonitoring and auditing in clinical trials
Monitoring and auditing in clinical trialsJyotsna Kapoor
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecturepcherukumalla
 

Viewers also liked (19)

ICH and GCP by Naveen
ICH and GCP by NaveenICH and GCP by Naveen
ICH and GCP by Naveen
 
A CTTI Survey of Current Monitoring Practices
A CTTI Survey of Current Monitoring PracticesA CTTI Survey of Current Monitoring Practices
A CTTI Survey of Current Monitoring Practices
 
Clinical Trials 101
Clinical Trials 101Clinical Trials 101
Clinical Trials 101
 
Developing Protocols & Procedures for CT Data Integrity
Developing Protocols & Procedures for CT Data Integrity Developing Protocols & Procedures for CT Data Integrity
Developing Protocols & Procedures for CT Data Integrity
 
Clinical Trials Glossary
Clinical Trials GlossaryClinical Trials Glossary
Clinical Trials Glossary
 
Clinical Trials in India
Clinical Trials in IndiaClinical Trials in India
Clinical Trials in India
 
Monitoring Visits
Monitoring VisitsMonitoring Visits
Monitoring Visits
 
Qc in clinical trials
Qc in clinical trialsQc in clinical trials
Qc in clinical trials
 
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on HadoopApache Falcon - Simplifying Managing Data Jobs on Hadoop
Apache Falcon - Simplifying Managing Data Jobs on Hadoop
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
Scope of pharmacology
Scope of pharmacologyScope of pharmacology
Scope of pharmacology
 
ICH GCP
ICH GCPICH GCP
ICH GCP
 
Ethical Considerations In Clinical Trials
Ethical  Considerations In  Clinical  TrialsEthical  Considerations In  Clinical  Trials
Ethical Considerations In Clinical Trials
 
Clinical Trials Introduction
Clinical Trials IntroductionClinical Trials Introduction
Clinical Trials Introduction
 
Clinical trials flow process
Clinical trials flow processClinical trials flow process
Clinical trials flow process
 
Imaging biomarkers in Clinical trials
Imaging biomarkers in Clinical trialsImaging biomarkers in Clinical trials
Imaging biomarkers in Clinical trials
 
Monitoring and auditing in clinical trials
Monitoring and auditing in clinical trialsMonitoring and auditing in clinical trials
Monitoring and auditing in clinical trials
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 
Clinical Trials - An Introduction
Clinical Trials - An IntroductionClinical Trials - An Introduction
Clinical Trials - An Introduction
 

Similar to Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

Ogce Workflow Suite
Ogce Workflow SuiteOgce Workflow Suite
Ogce Workflow Suitesmarru
 
How To Build A Stable And Robust Base For a “Cloud”
How To Build A Stable And Robust Base For a “Cloud”How To Build A Stable And Robust Base For a “Cloud”
How To Build A Stable And Robust Base For a “Cloud”Hardway Hou
 
Scientific
Scientific Scientific
Scientific marpierc
 
Combat Cyber Threats with Cloudera Impala & Apache Hadoop
Combat Cyber Threats with Cloudera Impala & Apache HadoopCombat Cyber Threats with Cloudera Impala & Apache Hadoop
Combat Cyber Threats with Cloudera Impala & Apache HadoopCloudera, Inc.
 
Virtual Gov Day - IT Operations Breakout - Jennifer Green, R&D Scientist, Los...
Virtual Gov Day - IT Operations Breakout - Jennifer Green, R&D Scientist, Los...Virtual Gov Day - IT Operations Breakout - Jennifer Green, R&D Scientist, Los...
Virtual Gov Day - IT Operations Breakout - Jennifer Green, R&D Scientist, Los...Splunk
 
Using VisualSim Architect for Semiconductor System Analysis
Using VisualSim Architect for Semiconductor System AnalysisUsing VisualSim Architect for Semiconductor System Analysis
Using VisualSim Architect for Semiconductor System AnalysisDeepak Shankar
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...GeeksLab Odessa
 
Pattern-Oriented Distributed Software Architectures
Pattern-Oriented Distributed Software Architectures Pattern-Oriented Distributed Software Architectures
Pattern-Oriented Distributed Software Architectures David Freitas
 
VMware vFabric gemfire for high performance, resilient distributed apps
VMware vFabric gemfire for high performance, resilient distributed appsVMware vFabric gemfire for high performance, resilient distributed apps
VMware vFabric gemfire for high performance, resilient distributed appsVMware vFabric
 
XSEDE14 SciGaP-Apache Airavata Tutorial
XSEDE14 SciGaP-Apache Airavata TutorialXSEDE14 SciGaP-Apache Airavata Tutorial
XSEDE14 SciGaP-Apache Airavata Tutorialmarpierc
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2Bill Liu
 
PMIx Updated Overview
PMIx Updated OverviewPMIx Updated Overview
PMIx Updated OverviewRalph Castain
 
How HPE 3PAR Can Help YOur Mission Critical on Cloud : Seminar Protecting Mi...
 How HPE 3PAR Can Help YOur Mission Critical on Cloud : Seminar Protecting Mi... How HPE 3PAR Can Help YOur Mission Critical on Cloud : Seminar Protecting Mi...
How HPE 3PAR Can Help YOur Mission Critical on Cloud : Seminar Protecting Mi...PT Datacomm Diangraha
 
Case Study For Service Providers Analysis Platform
Case Study For Service Providers Analysis PlatformCase Study For Service Providers Analysis Platform
Case Study For Service Providers Analysis PlatformMike Taylor
 
Data Science in the cloud with Microsoft Azure
Data Science in the cloud with Microsoft Azure Data Science in the cloud with Microsoft Azure
Data Science in the cloud with Microsoft Azure TechExeter
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Till Rohrmann
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPaige_Roberts
 

Similar to Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials (20)

Uses of Data Lakes
Uses of Data Lakes Uses of Data Lakes
Uses of Data Lakes
 
Ogce Workflow Suite
Ogce Workflow SuiteOgce Workflow Suite
Ogce Workflow Suite
 
How To Build A Stable And Robust Base For a “Cloud”
How To Build A Stable And Robust Base For a “Cloud”How To Build A Stable And Robust Base For a “Cloud”
How To Build A Stable And Robust Base For a “Cloud”
 
Scientific
Scientific Scientific
Scientific
 
Combat Cyber Threats with Cloudera Impala & Apache Hadoop
Combat Cyber Threats with Cloudera Impala & Apache HadoopCombat Cyber Threats with Cloudera Impala & Apache Hadoop
Combat Cyber Threats with Cloudera Impala & Apache Hadoop
 
Virtual Gov Day - IT Operations Breakout - Jennifer Green, R&D Scientist, Los...
Virtual Gov Day - IT Operations Breakout - Jennifer Green, R&D Scientist, Los...Virtual Gov Day - IT Operations Breakout - Jennifer Green, R&D Scientist, Los...
Virtual Gov Day - IT Operations Breakout - Jennifer Green, R&D Scientist, Los...
 
Using VisualSim Architect for Semiconductor System Analysis
Using VisualSim Architect for Semiconductor System AnalysisUsing VisualSim Architect for Semiconductor System Analysis
Using VisualSim Architect for Semiconductor System Analysis
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
 
Pattern-Oriented Distributed Software Architectures
Pattern-Oriented Distributed Software Architectures Pattern-Oriented Distributed Software Architectures
Pattern-Oriented Distributed Software Architectures
 
VMware vFabric gemfire for high performance, resilient distributed apps
VMware vFabric gemfire for high performance, resilient distributed appsVMware vFabric gemfire for high performance, resilient distributed apps
VMware vFabric gemfire for high performance, resilient distributed apps
 
XSEDE14 SciGaP-Apache Airavata Tutorial
XSEDE14 SciGaP-Apache Airavata TutorialXSEDE14 SciGaP-Apache Airavata Tutorial
XSEDE14 SciGaP-Apache Airavata Tutorial
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2
 
PMIx Updated Overview
PMIx Updated OverviewPMIx Updated Overview
PMIx Updated Overview
 
Khushi Muhammad Resume
Khushi Muhammad ResumeKhushi Muhammad Resume
Khushi Muhammad Resume
 
How HPE 3PAR Can Help YOur Mission Critical on Cloud : Seminar Protecting Mi...
 How HPE 3PAR Can Help YOur Mission Critical on Cloud : Seminar Protecting Mi... How HPE 3PAR Can Help YOur Mission Critical on Cloud : Seminar Protecting Mi...
How HPE 3PAR Can Help YOur Mission Critical on Cloud : Seminar Protecting Mi...
 
Case Study For Service Providers Analysis Platform
Case Study For Service Providers Analysis PlatformCase Study For Service Providers Analysis Platform
Case Study For Service Providers Analysis Platform
 
Data Science in the cloud with Microsoft Azure
Data Science in the cloud with Microsoft Azure Data Science in the cloud with Microsoft Azure
Data Science in the cloud with Microsoft Azure
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
 

More from Eran Chinthaka Withana

Opensource development and apache software foundation
Opensource development and apache software foundationOpensource development and apache software foundation
Opensource development and apache software foundationEran Chinthaka Withana
 
Towards Enabling Mid-Scale Geo-Science Experiments Through Microsoft Trident ...
Towards Enabling Mid-Scale Geo-Science Experiments Through Microsoft Trident ...Towards Enabling Mid-Scale Geo-Science Experiments Through Microsoft Trident ...
Towards Enabling Mid-Scale Geo-Science Experiments Through Microsoft Trident ...Eran Chinthaka Withana
 
CBR Based Workflow Composition Assistant
CBR Based Workflow Composition AssistantCBR Based Workflow Composition Assistant
CBR Based Workflow Composition AssistantEran Chinthaka Withana
 

More from Eran Chinthaka Withana (7)

Cassandra At Wize Commerce
Cassandra At Wize CommerceCassandra At Wize Commerce
Cassandra At Wize Commerce
 
Opensource development and apache software foundation
Opensource development and apache software foundationOpensource development and apache software foundation
Opensource development and apache software foundation
 
Towards Enabling Mid-Scale Geo-Science Experiments Through Microsoft Trident ...
Towards Enabling Mid-Scale Geo-Science Experiments Through Microsoft Trident ...Towards Enabling Mid-Scale Geo-Science Experiments Through Microsoft Trident ...
Towards Enabling Mid-Scale Geo-Science Experiments Through Microsoft Trident ...
 
Versioning for Workflow Evolution
Versioning for Workflow EvolutionVersioning for Workflow Evolution
Versioning for Workflow Evolution
 
Web Services in the Real World
Web Services in the Real WorldWeb Services in the Real World
Web Services in the Real World
 
Axis2 Landscape
Axis2 LandscapeAxis2 Landscape
Axis2 Landscape
 
CBR Based Workflow Composition Assistant
CBR Based Workflow Composition AssistantCBR Based Workflow Composition Assistant
CBR Based Workflow Composition Assistant
 

Recently uploaded

computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction managementMariconPadriquez1
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 

Recently uploaded (20)

computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction management
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 

Redefining ETL Pipelines with Apache Technologies to Accelerate Decision-Making and Execution for Clinical Trials

  • 1. Redefining ETL Pipelines with Apache Technologies to Accelerate Decision Making for Clinical Trials Eran Withana
  • 2. www.comprehend.com Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System  Challenges  Future Work Overview
  • 3. www.comprehend.com Open Source Member, PMC member and committer of ASF Apache Axis2, Web Services, Synapse, Airavata Education PhD in Computer Science from Indiana University Software engineer at Comprehend Systems About me …
  • 4. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 5. www.comprehend.com Clinical Trials – Lay of the land Number of Drugs in Development Worldwide (Source: CenterWatch Drugs in Clinical Trial Database 2014) Source: http://www.phrma.org/innovation/clinical-trials
  • 6. www.comprehend.com Clinical Trials – Lay of the Land Multiple Stakeholders • Study Managers • Program Managers • Monitors • Data Managers • Bio-statisticians • Executives • Medical Affairs • Regulatory • Vendors • CROs • CRAs Sites Labs Patients Safety EDC Reports ● Latent ● Fragmented Data PV Data Excel Sponsor Contract Research Organization (CRO) Sites and Investigators
  • 7. www.comprehend.com For decades, clinical development was primarily paper-based.
  • 8. www.comprehend.com Various Software and Practices Used in Each Layer medidata CROs and SIs Technologies
  • 9. www.comprehend.com Clinical Trials with Centralized Monitoring Clinical Operations Sites Labs Patients ● Consolidated ● Real-time ● Self-Service ● Mobile Clinical Analytics & Collaboration Data Safet y EDC PV Data Excel
  • 10. www.comprehend.com Providing up-to-date answers Executives Medical Review CRAs Data Management Clinical Operations EDC CTMS Safety ePro Other Web Ad-Hoc Mobile Collab
  • 11. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 12. www.comprehend.com FDA, HIPAA Compliance Metadata/Database structure synchronization Less frequent (once a day) Data Synchronization More frequent (multiple times a day) Ability to plugin various data sources RAVE, MERGE, BioClinica, File Imports, DB-to-DB Synchs Real time event propagations Adverse events (AEs) - the need for early identification Business Requirements
  • 13. www.comprehend.com Hardware agnostic for resiliency and better utilization Repeatable deployments Real time processing and real time events Fault Tolerance In flight and end state metrics for alerting and monitoring Flexible and pluggable adapter architecture Time travel Audit trails Report generations Technical Requirements
  • 14. www.comprehend.com Events all the way Shared event bus for multiple consumers Use of language agnostic data representations (via protobuf) Automatic datacenter resources management (Mesos/Marathon/Docker) Core Design Principles
  • 15. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 16. www.comprehend.com • Data processing  Apache Storm and Trident, Apache Spark and Spark Streaming, Samza, Summingbird, Scalding, Apache Falcon, Azkaban • Coordination and Configuration Management  Apache Zookeeper, Redis, Apache Curator • Event Queue  Apache Kafka • Scheduling  Chronos, Apache Mesos, Marathon, Apache Aurora • Database Synchronization  Liquibase, Flyway DB • Data Representations  Apache Thrift, protobuf, Avro • Deployments  Ansible • File Management  Apache HDFS • Monitoring and alerting  Graphite, StatsD • Database  PostgreSQL, Apache Spark • Resource Isolation  LXC, Docker Technologies Evaluated
  • 17. www.comprehend.com Data Processing Technology Evaluation Criteria Storm + Trident Spark + Streaming Samza Summingbird Scalding Falcon Chronos Aurora Azkaban DAG Support Y DAGScheduler Y Y Y Y Y N Y DAG Nodes Resiliency Y Y Y Y Y Y Y N Y Event Driven Y Y Y Y N N N N N Timed Execution Y Y Y Y Y Y Y Y DAG Extension Y Y Y Y Y Y Y Y Y Inflight and end state metrics Y Y Y Y Y Y Y Y Y Hardware Agnostic Y Y Y Y Y Y Y Y Y
  • 18. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 20. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 21. www.comprehend.com Bare Metal Boxes Partitioned using LXC containers Use of Mesos to do the resource allocations as needed for jobs Managing Hardware
  • 22. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 24. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 26. • Syncher is for DB structural changes  Syncher creates a database schema from the source information  Runs a generic database diff and applies those to the target database • Seeder is for data synchronization  Uses the database schema created by Syncher • Seeders gets jobs from  Syncher or  Timed scheduler Data Adapters
  • 27. • Coordination and Configuration through Zookeeper Job configuration Connection information Distributed locking and counters Metric Maintenance Last successful run Data Adapters – Coordination and Configuration
  • 29. www.comprehend.com  Syncher  Connectivity to source/sink systems fail • Retry job N times and alert, if needed  Schema changes to the database fails in the middle • Transaction rollback  Seeder  Connectivity to source/sink systems fail • Retry job N times and alert, if needed  If seeding fails midway • Storm retries tuples • Failing tuples are moved to an error queue  Table and row level failues • Option to skip the tables/rows but send a report at the end  Effect on “live” tables during data synchronizations • Option to use transactions or • Use temporary tables and swap with original upon completion Failure Modes
  • 30. www.comprehend.com Can bring in data from more data sources and more studies effectively Run real time reports on studies and configure alerts (future) Can configure refreshes as needed by each use case Can throttle input and output sources at study/customer level Ability to onboard new customers and deploy new studies with minimal human intervention What Have We Gained
  • 31. www.comprehend.com A generic framework which eases integration with new data sources • For each new source, implement a method to create a virtual schema and to get data for a given table can scale and fault tolerant has generic monitoring and alerting eases maintenance since its mostly generic code notification of important events through messages runs on any hardware What Have We Gained
  • 32. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 33. www.comprehend.com Accessibility Customers must be able to drop files securely (SFTP like functionality) Ability to access resources through URLs Data storage Scalability and Redundancy Scale-out by adding nodes Resilience against loss of nodes, data centers and replication Miscellaneous Access control over read/write Performance/usage/resource utilization monitoring Distributed File System - Requirements
  • 34. www.comprehend.com Two name nodes running in HA mode, co-located with two journal nodes Third journal node on a separate node Data nodes on all bare metal nodes Mounting HDFS with FUSE and enabling SFTP through OS level features Automatic failover through DNS and HA Proxy HDFS with High Availability Mode
  • 35. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 36. www.comprehend.com Regulatory requirements Data encryption requirements for clinical data Audit trails Data quality Source system constraints Coordination between Synchers and Seeders Distributed locks and counters Automatic fail over when a name node fails in HDFS HDFS HA mode stores active name node in ZK as a java serialized object, yikes !! Challenges
  • 37. Clinical Trials – Lay of the land Business and Technical Requirements Technology Evaluation High Level Architecture Implementation Managing Hardware Deployments Data Adapters: Implementation and Failure Modes Distributed File System Challenges Future Work Overview
  • 38. www.comprehend.com Time travel Ability to go back in time and run reports at any given point of time Trail of data Containerization In-memory query execution with Apache Spark Future Work

Editor's Notes

  1. Dose 20-100 Efficacy and safety 100-300 > 1000