2. A bit about me
I am a computer scientist who builds data management and analytics systems
My talk is from the perspective of a big data analytics system builder
I have some exposure to healthcare domain data and analytics problems by
collaborating with experts in IBM Watson Health division
2
3. What is big data?
Gartner’s 3Vs definition:
“Big Data is high-volume, high-velocity and/or high-variety
information assets that demand cost-effective, innovative forms of
information processing that enable enhanced insight, decision
making, and process automation.”
Extra Vs
Variability, Veracity, Visualization, Value
How big is big data?
It is all relative
It is always a moving definition
It is not all about the size
My answer: when conventional data management and analytics tools
are inadequate = big data
3
Figure from https://www.linguamatics.com/blog/big-data-real-world-data-
where-does-text-analytics-fit
Big Data 3Vs
4. Why is big data important for health
care?
Large volumes of data
eHealth
mHealth
Sensor & wearable technologies
Genome sequencing
New applications
Personalized medicine
Clinical risk intervention
Predictive analytics
4
Big Data
6. Two dimensions of big data analytics
Data type
Structured data
Records in relational database
tables
Semi-structured data
Json and XML
Unstructured data
Text data
Graph data
Social and interaction data
Multi-media data
Images and videos
Complexity of analytics
Data entry and retrieval
Look for a patient’s EHR at check in
Descriptive summaries
Compute the number of outbreaks
across different geo regions
Pattern discovery (data mining)
Identify unusual patterns of medical
claims by clinics, physicians, labs, etc
Predictive analytics (machine learning)
Predict a patient’s readmission to the
hospital
6
7. Big data analytics landscape
7
Structured Semi-structured Graph Text Multi-media
Date entry and
retrieval
OLTP (online
transactional
processing)
key-value/document
stores
graph databases keyword search …
Descriptive
summaries
SQL-on-Hadoop*: OLAP (online
analytical processing)
degree distribution,
clustering coefficient
distribution
word cloud …
Pattern discovery
(data mining)
DM on big data: frequent pattern
mining, anomaly detection, clustering
Graph processing: graph
clustering, influence
analysis
topic modeling,
sentiment analysis
…
Predictive analytics
(machine learning)
ML on big data: regression, classification, recommendation, link predication
Data Type
AnalyticsComplexity
8. Big data analytics landscape
8
Structured Semi-structured Graph Text Multi-media
Date entry and
retrieval
OLTP (online
transactional
processing)
key-value/document
stores
graph databases keyword search …
Descriptive
summaries
SQL-on-Hadoop*: OLAP (online
analytical processing)
degree distribution,
clustering coefficient
distribution
word cloud …
Pattern discovery
(data mining)
DM on big data: frequent pattern
mining, anomaly detection, clustering
Graph processing: graph
clustering, influence
analysis
topic modeling,
sentiment analysis
…
Predictive analytics
(machine learning)
ML on big data: regression, classification, recommendation, link predication
Data Type
AnalyticsComplexity
9. Big data analytics landscape
9
Structured Semi-structured Graph Text Multi-media
Date entry and
retrieval
OLTP (online
transactional
processing)
key-value/document
stores
graph databases keyword search …
Descriptive
summaries
SQL-on-Hadoop*: OLAP (online
analytical processing)
degree distribution,
clustering coefficient
distribution
word cloud …
Pattern discovery
(data mining)
DM on big data: frequent pattern
mining, anomaly detection, clustering
Graph processing: graph
clustering, influence
analysis
topic modeling,
sentiment analysis
…
Predictive analytics
(machine learning)
ML on big data: regression, classification, recommendation, link predication
Data Type
AnalyticsComplexity
10. Background on traditional SQL
processing
OLTP (online transactional processing) vs OLAP (online analytical processing)
Specialized OLTP and OLAP systems connected by the ETL (extract, transform,
load) process
10
Purpose Queries Speed
OLTP Data entry and retrieval Simple read, insert, update and delete Real-time (low latency and
high throughput)
OLAP BI (business intelligence) or
reporting
More complex analytical and ad hoc
queries (mostly optimized for read)
Interactive
Transactions Analytic Queries
ETL /
Replication
OLTP System OLAP System
EDW (enterprise data
warehouse)
11. Why SQL-on-Hadoop?
SQL (Structured Query Language) is the de facto language for transactional
and decision support systems and BI tools
Healthcare analysts and hospital IT experts are very familiar with SQL
SQL-on-Hadoop eases the transition to big data
Little or no change to existing BI tools and applications
SQL-on-Hadoop overcomes some shortcomings of conventional EDWs
Scalability & fault tolerance
Better support for semi-structured data
Directly work on raw data (query in situ) by avoiding ETL
11
12. Open Data
SQL Layer
Remove Query
SQL-on-Hadoop Landscape
Impala
Big SQL PolyBase
Proprietary Data
Vortex
SQL-H
Spark SQL
MPP Query Engine
12
dashDB
13. Technical Challenge
How to distribute data and computation in a large cluster of machines for performance
Bottleneck: transferring large volumes of data across the network
Example: join (combining columns from multiple tables)
13
PID VisitDate Reason
1 2016-03-15 Fever
2 2016-10-20 Headache
1 2017-02-08 Fever
3 2017-06-18 Cold
PID Name BOD Sex
1 Jim Green 1980-04-15 M
2 Alice Lee 1965-11-11 F
3 Rose Darcy 2001-07-21 F
PID VisitDate Reason Name BOD Sex
1 2016-03-15 Fever Jim Green 1980-04-15 M
2 2016-10-20 Headache Alice Lee 1965-11-11 F
1 2017-02-08 Fever Jim Green 1980-04-15 M
3 2017-06-18 Cold Rose Darcy 2001-07-21 F
Clinical Visits Patient Info
14. SQL-on-Hadoop Strategies (1/2)
Storing data in formats that are easy for query processing
Columnar data formats (Parquet, ORCFile)
Pushing analytics close to the data
Intelligent data readers (apply predicates and projections while read the data)
Carefully choosing the algorithm and what data to transfer for each analytics operation
E.g. how to choose from different join algorithms based on data characteristics
14
VS
Broadcast smaller table
network cost: 2|G|
Repartition both tables
network cost: 2/3|B|+2/3|G|
Blue table (B)
Green table (G)
15. SQL-on-Hadoop Strategies (2/2)
Pre-process data into better organization for queries
Hash or range-based data partitioning and bucketing
Auxiliary data structures for eliminating unnecessary data access
Indexing and synopsis
Better data placement for related data
E.g. collocating related data together on HDFS (Hadoop distributed file system)
15
Co-partition
network cost: |G|
21
1 1
12
2 2
3
3 3
3
Co-partition and co-location
network cost: 0
1
1 1
2
2 2
3
3 3
1 2 3
16. Big data analytics landscape
16
Structured Semi-structured Graph Text Multi-media
Date entry and
retrieval
OLTP (online
transactional
processing)
key-value/document
stores
graph databases keyword search …
Descriptive
summaries
SQL-on-Hadoop*: OLAP (online
analytical processing)
degree distribution,
clustering coefficient
distribution
word cloud …
Pattern discovery
(data mining)
DM on big data: frequent pattern
mining, anomaly detection, clustering
Graph processing: graph
clustering, influence
analysis
topic modeling,
sentiment analysis
…
Predictive analytics
(machine learning)
ML on big data: regression, classification, recommendation, link predication
Data Type
AnalyticsComplexity
17. Machine learning on big data
SQL analytics tools are not enough to capture the full value of big data
Big data impact on ML (machine learning):
Opportunities:
More training data better predications
We can train a model with billions of parameters, because we have sufficiently big data
Making deep learning possible!
Challenges:
Scalability and distributed computing
A big learning curve for data scientists
17
19. Different levels of abstractions for big
ML systems
ML libraries
E.g. Spark MLlib, H2O, IBM Watson
Provide a list of parameterized ML algorithms
Declarative ML
E.g. SystemML, Mahout
Expose R or Matlab like language for users
Primitive: linear algebra and math operations
Cost-based optimizer to compile execution plans
Also provide a library of ML algorithms
AutoML
E.g. H2O
Automate the process of training a large selection of candidate models
19
Hadoop or
Spark Cluster
(scale-out)
In-Memory
Single Node
(scale-up)
Runtime
Compiler
Language
SystemML
20. Big data analytics landscape
20
Structured Semi-structured Graph Text Multi-media
Date entry and
retrieval
OLTP (online
transactional
processing)
key-value/document
stores
graph databases keyword search …
Descriptive
summaries
SQL-on-Hadoop*: OLAP (online
analytical processing)
degree distribution,
clustering coefficient
distribution
word cloud …
Pattern discovery
(data mining)
DM on big data: frequent pattern
mining, anomaly detection, clustering
Graph processing: graph
clustering, influence
analysis
topic modeling,
sentiment analysis
…
Predictive analytics
(machine learning)
ML on big data: regression, classification, recommendation, link predication
Data Type
AnalyticsComplexity
21. Graph analytics on big data
Graphs provide a powerful primitive for modeling real-world objects and the
relationships between objects
Patient-patient/doctor-patient interactions, biological pathways, protein
interaction networks, ontologies, knowledge graphs, etc
Two types:
Graph databases: focus on real-time graph analytics
Graph processing systems: focus on batch processing of graphs
21
22. Graph databases
Real-time graph analytics
Updates, simple node and edge retrieval
Pattern matching queries
Given a graph pattern, find subgraphs in the database graphs that (exactly or
approximately) match the query
Example: find out what biological processes are affected by a disease
Querying a disease pathway against a database of known pathways
22
Graph Databases
SAGA (query
against a database
of pathways)
23. Graph processing systems
Batch graph analytics
Long running (usually iterative) analysis on the entire graph
E.g. PageRank algorithm to identify key influencers of a disease propagation
network
Performance bottleneck: network overhead
Better graph partitioning and absorbing messages within a partition
Combining messages (when messages can be aggregated)
23
Graph Processing
Microsoft
Graph Engine
24. Big data analytics landscape
24
Structured Semi-structured Graph Text Multi-media
Date entry and
retrieval
OLTP (online
transactional
processing)
key-value/document
stores
graph databases keyword search …
Descriptive
summaries
SQL-on-Hadoop*: OLAP (online
analytical processing)
degree distribution,
clustering coefficient
distribution
word cloud …
Pattern discovery
(data mining)
DM on big data: frequent pattern
mining, anomaly detection, clustering
Graph processing: graph
clustering, influence
analysis
topic modeling,
sentiment analysis
…
Predictive analytics
(machine learning)
ML on big data: regression, classification, recommendation, link predication
Data Type
AnalyticsComplexity
25. Integrated analytics
An application often require different types of analytics together
E.g. SQL is often used to prepare the data for ML
An example: Medtronic & IBM Watson Health Partnership
"gathers a patient’s readings from Medtronic insulin pumps and glucose monitors,
and combines them with information taken from the individual’s activity trackers
and diet. The system uses pattern recognition gleaned through IBM’s Watson to
provide feedback on how a patient can manage their diabetes”
“Medtronic's insulin pumps using Watson artificial intelligence (AI) could warn
patients of abnormally low blood sugar levels up to three hours in advance”
25
References:
https://www.meddeviceonline.com/doc/ibm-watson-to-power-medtronic-s-diabetes-app-under-armour-s-fitness-app-0001
26. Solutions for Integrated analytics
Integrating existing analytics systems
Data transformation: transform the data format between different systems
Data transfer: transfer the output of one system to another system
Building a single system for various types of analytics
E.g Spark, Wildfire (IBM Project EventStore)
26
Spark
OLAPOLTP ML Stream
Batch
GA
Shared Storage
Wildfire
Real
Time GA
27. Conclusion
Big data analytics comes in different forms
What types of data do you have?
What level of complexity does the analytics require?
What is the latency requirement?
An application often require different types of analytics together
What types of analytics do you need to integrate?
What is your performance requirement?
Do you need to integrating existing analytics pipelines or can you start with a
single systems that supports all analytics?
27
I will try to provide a roadmap in this talk to help you navigate through the big data analytics landscape.
I'm not a healthcare domain expert, however, I have some exposure to healthcare domain data and analytics problems and have been collaborating with experts Watson Health division to formulate the talk
The first question before we talk about big data analytics is what is big data? The most popular definiion is the 3V defitnion from Gartner.
And over the years, others have extended the deinfition of big data with more vs.
The next question people usually ask is how do you know you have big data? How big is big data?
Well, it is all relative, and with the technology advancement in storage and data processing, it is always moving definiton. 10 years ago, people think 1 petabyte of data so huge, nowdays is becoming very common, now people start to talk about exabyte, and even zettabyte. And as we have seen the 3 v defintion, its not all about size. So, how big is big data? There is agreed upon answer. My answer to this question is that you know you are dealing with big data when the convention data management and analytics tools are not enough.
Volume - The quantity of generated and stored data. The size of the data determines the value and potential insight- and whether it can actually be considered big data or not.
Variety - The type and nature of the data. This helps people who analyze it to effectively use the resulting insight.
Velocity - In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development.
Variability - nconsistency of the data set can hamper processes to handle and manage it.
Veracity - The quality of captured data can vary greatly, affecting the accurate analysis.
Large volumes of data are being accumlated in the healthcare domain, due to ehealth, mobile health, the wide use of senor and wearable technologis, and the advancement of genome sequencing. In addition, a number of new healthcare applications emerge because of big data, such as
Big data analytics can mean different things to different people. For some people it is machine learning, others it may be SQL analytics. It is not suprising, because big data analytics comes in different forms. In this talk, I will categorize big data analytics.
I will categorize big data analytics along two dimensions.
The other dimension is complexity of analytics, starting from simple to more complex. The simplest type of analytics does update and data retrieval, for example, retrieving a patients HER record when she checks in a hostpital. The next type is creating descriptive summaries, which groups data and computing statistics. The next level goes beyong computing simple statistics to discovering patterns using data mining techniques. For example, for fraud detection, identifying unusual patterns of medical claims by clincs, phisicans, labs,so on. The last level is predicative analytics using machine learning techniques. For example, predicint whether a patient will be readmitted to the hospital based on the history data.
Now here is the big data analytcs landscape along the two dimensions. The horizental dimension is data type, and the vertical dimension is the analytic complexity. I am not familary with image or video processing, so I am going to leave multi-media out.
For structured data, data entry and retrievel is basically oLTP, and for semi-structured data, people use key-value/document store, such as cassandra or mogodb for data entry and retrieval. For graph data entry and retrievel, people use graph databases, like neo4j and janusgraph. For text data, people use search systems for keyword search.
For both structured and unstructure data, people use SQL-on-Hadoop systems for descriptive sumaries. They basically do OLAP Notice the the start next Hadoop? Here Hadoop is abused to represent big data, many SQ-on-Hadoop systems are not really using hadoop underneath. For graphs, people basically compute …, and for text, word cloud is most widely used method to compute descriptive summaries. For structured and unstructured data, people use data minging on big data for ...., for graphs, people use graph processing systems for graph clustering, influence analysis etc. Examples of pattern discovery on text are topic modeling and sentiment analysis. For predictive analytics, big ML systems for used for all different types of data, but depending on the actual data types, you may need to do some data transformation to be able to use ML.
Over the years, I have worked on a number of types of big data analytics. I will cover these types in this talk.
Now here is the big data analytcs landscape along the two dimensions. The horizental dimension is data type, and the vertical dimension is the analytic complexity. I am not familary with image or video processing, so I am going to leave multi-media out.
For structured data, data entry and retrievel is basically oLTP, and for semi-structured data, people use key-value/document store, such as cassandra or mogodb for data entry and retrieval. For graph data entry and retrievel, people use graph databases, like neo4j and janusgraph. For text data, people use search systems for keyword search.
For both structured and unstructure data, people use SQL-on-Hadoop systems for descriptive sumaries. They basically do OLAP Notice the the start next Hadoop? Here Hadoop is abused to represent big data, many SQ-on-Hadoop systems are not really using hadoop underneath. For graphs, people basically compute …, and for text, word cloud is most widely used method to compute descriptive summaries. For structured and unstructured data, people use data minging on big data for ...., for graphs, people use graph processing systems for graph clustering, influence analysis etc. Examples of pattern discovery on text are topic modeling and sentiment analysis. For predictive analytics, big ML systems for used for all different types of data, but depending on the actual data types, you may need to do some data transformation to be able to use ML.
Before, I talk about SQL-on-Haoop, I will briefly provide some background info on tranditional SQL processing. In traditional SQL, there are two types: OLTP and OLAP. And the difference between them is listed in this table. Because OLTP and OLAP sytems have very different characteristics, the database field has evloved into having specialed OLTP systems and OLAP systems. And ETL process is used consolidcate and transform transactional data from OLTP sysytem to OLAP systems.
Name any application in use at a hospital or in a physician’s office, and the chances are good that it runs on an OLTP database.
EHRs, lab systems, financial systems, patient satisfaction systems, patient identification, billing and payment processing, ect.
SQL (Structured Query Language) is the de facto language for transactional and decision support systems and BI tools to access and query a variety of data sources
Transitioning to big data requires a steep learning curve,
SQL-on-hadoop systems support data warehousing functionalities on big data, I,e. they focus olap queries. There are so many SQL-on-Hadoop systems today. The can be categrozed in several camps. The first camp support querying exsiting data in open format, there is no lock-in. The camp can be further categorized in sub-groups, with first group just builds a simple SQL layer on existing data platforms like HIVe builing on mapreduce, Spark SQL building on Spark, where as the second subgroup builds a MPP query engine from scrach, The second group typical have better performance. The second camp controles the storage layer and uses propriety formats. The last camp extends existing EDW to work with big data.
Querying existing data with open format vs controlled storage layer with proprietary formats?
A SQL layer on top of existing big data systems (like MapReduce or Spark) or a MPP query engine architected from ground up?
Directly querying big data vs going through an existing database?
The major technical challenge for SQL-on-Hadoop systems is to distribute data and computation in…
Quite often the major bottleneck is transferring large volumes of data across the network. Let’s use the database join operator to illustrate this challenge. Join is a database operator to combing columns from multiple tables togather. For example, one can join the clinical visits and patient info tables on the patient id. The join will bring in the records with same pid together. In the big data setting, the two tables are partitioned and distributed across the cluster, so the join processing needs to transfer data acros the network to actually performan the query.
Here are some strategies applied in many SQL-on-Hadoop systems to address the changelles.
For example, in the past, I have worked on comparing different join algorithms for big data and provid guidelines on how to choose from differen join algorithms for a particular query based on data characteristics. For example, one join strategy for joining a big table with a small table is broacasting the smaller table to all machines in the cluster, then perform local joins on each machne. In this Figure, I have two tables, a blue table and a green table, they are all distributed across the machines in the cluster, in this particular case, the green table is the smaller table, so I will ship all the partitions of the green table to every node. The red arrow represnts the network communication. The algorithm in total sends 2 times of the size of the Green table across the network. Then there is another join strategy that is good for joining two large tables. This algorithm repartitions both tables, and send the corresponding partitions from both table to one of the machines for processing. This algorithm will end up sening in total this much table across the network. As you can see, depends on the size of the two tables, one algorithm may be perfered for particular join operation.
Data partitioning is to partition data based on some values, instead of randomly partition data. For when two tables are partitioned the same way on the join key, you only need to bring in the corresponding partitions together for join processing. This often will reduce the processing and network overhead.
Finally, better data placement can often bring in siginifcant performane boost. For example, in one of my works, I extended the HDFS to support collocation of related data in an best-effort approach. And using this technique can signifcantly reduce the network overhead. In this sample, not only the two tables are co-partitioned, but the corresponding partitons are also collocated, so when join the two tables togeher, no network cost is incurred.
We have talked so much on SQL-on-Hadoop, let’s now move on to machine learning on big data.
Here is where Machine learning comes to help. Machine learning is not a new field. But Big data has brought in huge impact on machine learning. First of all, It revived the whole machine learning field, because more training data usually leads to better preidcations. And now we have enough data to train a model with billions of parameters. Big data essentially enabled deep learning. At the same time, big data also bring in a lot challenges to machine learning as well, such as scalabitiy and distributed computation. And more importantly, it emposes a big learning curve for data scientss, because they do not only need to worry about the particular ML algorithm, but how to distirbuted the data and computation in the big data plaftform.
To help reduce the learning curve many big ML systems emerged. They are usually categoried into two camps. One camp for general machine learning, the other camp specialled in deep learning. But the trend now is that two camps start to converge together, with general ML systems start to support deep learning, and the deep learning camp also start to support general ml algorithms. Personally, I haven’t worked much on deep learning, so I will focus on the general ML camp.
The big ml systems help data scientists by masking the details of implementing ml algorithms for big data. There are different levels of abstractions that big ML systems provide. One grpup of big machine learning systems provide the users with a library of machine learning lagorithms. The behavior of each algorithm can be controlled by the parameters. But that’s it. The algorithm are pretty much black boxes for the users. There is no way to change the internals of the algorithm. This problem is addressed by declaraive ml systems, like systemML and mahout. These systems usuallly expose an R-or matlab like language for data scientists, with linear algebra and math operations as the primative. The the system employs a cost-based optimizer to compile the algorithm into effcient execution plans on the target platform. Finally, recently, H2O has propopsed a new concept called AutoML. Usually, for a particular application, a data scientist usually tries a large number of candidate models and selects the best. AutoML basically automates this process.
Next, I will briefly talk about graph databases and grpah processing together.
Popular graph databases include Neo4j, janus graph, ibm graph etc. They focus on real-time graph analytics. Besides upates and simple node ane edge retrievel, most graph databases support graph pattern matching query. Basically, given a graph pattern, they find subgraphs in the database that match the query. Most graph database only support exact match. But sometimes, approximate match necessary when graph data is noisy. In one of my PhD work, I have built a system called SAGA for approxiate graph matchng, And it can support querying a disease parthy againse a data of known pathways to find out what biologial processed are affected by a disease.
The second type of graph analytics systems are graph processing systems. They focus on batch graph analytics. These are long running analysis on the entire grpah, They are often iterative.
Also new trend in these types of sytems is to deal
Over the years, I have worked on a number of types of big data analytics. I will cover these types in this talk.
The first way solution is to integrating existing analytics systems together. The two challenges here is