SlideShare une entreprise Scribd logo
1  sur  28
© 2017 IBM Corporation
Big Data Analytics: From
SQL to Machine Learning
and Graph Analysis
Yuanyuan Tian
IBM Research -- Almaden
Keynote for KDD bigdas 2017
A bit about me
 I am a computer scientist who builds data management and analytics systems
 My talk is from the perspective of a big data analytics system builder
 I have some exposure to healthcare domain data and analytics problems by
collaborating with experts in IBM Watson Health division
2
What is big data?
 Gartner’s 3Vs definition:
 “Big Data is high-volume, high-velocity and/or high-variety
information assets that demand cost-effective, innovative forms of
information processing that enable enhanced insight, decision
making, and process automation.”
 Extra Vs
 Variability, Veracity, Visualization, Value
 How big is big data?
 It is all relative
 It is always a moving definition
 It is not all about the size
 My answer: when conventional data management and analytics tools
are inadequate = big data
3
Figure from https://www.linguamatics.com/blog/big-data-real-world-data-
where-does-text-analytics-fit
Big Data 3Vs
Why is big data important for health
care?
 Large volumes of data
 eHealth
 mHealth
 Sensor & wearable technologies
 Genome sequencing
 New applications
 Personalized medicine
 Clinical risk intervention
 Predictive analytics
4
Big Data
Big data analytics
 Big data analytics comes in different forms!
5
Two dimensions of big data analytics
 Data type
 Structured data
 Records in relational database
tables
 Semi-structured data
 Json and XML
 Unstructured data
 Text data
 Graph data
 Social and interaction data
 Multi-media data
 Images and videos
 Complexity of analytics
 Data entry and retrieval
 Look for a patient’s EHR at check in
 Descriptive summaries
 Compute the number of outbreaks
across different geo regions
 Pattern discovery (data mining)
 Identify unusual patterns of medical
claims by clinics, physicians, labs, etc
 Predictive analytics (machine learning)
 Predict a patient’s readmission to the
hospital
6
Big data analytics landscape
7
Structured Semi-structured Graph Text Multi-media
Date entry and
retrieval
OLTP (online
transactional
processing)
key-value/document
stores
graph databases keyword search …
Descriptive
summaries
SQL-on-Hadoop*: OLAP (online
analytical processing)
degree distribution,
clustering coefficient
distribution
word cloud …
Pattern discovery
(data mining)
DM on big data: frequent pattern
mining, anomaly detection, clustering
Graph processing: graph
clustering, influence
analysis
topic modeling,
sentiment analysis
…
Predictive analytics
(machine learning)
ML on big data: regression, classification, recommendation, link predication
Data Type
AnalyticsComplexity
Big data analytics landscape
8
Structured Semi-structured Graph Text Multi-media
Date entry and
retrieval
OLTP (online
transactional
processing)
key-value/document
stores
graph databases keyword search …
Descriptive
summaries
SQL-on-Hadoop*: OLAP (online
analytical processing)
degree distribution,
clustering coefficient
distribution
word cloud …
Pattern discovery
(data mining)
DM on big data: frequent pattern
mining, anomaly detection, clustering
Graph processing: graph
clustering, influence
analysis
topic modeling,
sentiment analysis
…
Predictive analytics
(machine learning)
ML on big data: regression, classification, recommendation, link predication
Data Type
AnalyticsComplexity
Big data analytics landscape
9
Structured Semi-structured Graph Text Multi-media
Date entry and
retrieval
OLTP (online
transactional
processing)
key-value/document
stores
graph databases keyword search …
Descriptive
summaries
SQL-on-Hadoop*: OLAP (online
analytical processing)
degree distribution,
clustering coefficient
distribution
word cloud …
Pattern discovery
(data mining)
DM on big data: frequent pattern
mining, anomaly detection, clustering
Graph processing: graph
clustering, influence
analysis
topic modeling,
sentiment analysis
…
Predictive analytics
(machine learning)
ML on big data: regression, classification, recommendation, link predication
Data Type
AnalyticsComplexity
Background on traditional SQL
processing
 OLTP (online transactional processing) vs OLAP (online analytical processing)
 Specialized OLTP and OLAP systems connected by the ETL (extract, transform,
load) process
10
Purpose Queries Speed
OLTP Data entry and retrieval Simple read, insert, update and delete Real-time (low latency and
high throughput)
OLAP BI (business intelligence) or
reporting
More complex analytical and ad hoc
queries (mostly optimized for read)
Interactive
Transactions Analytic Queries
ETL /
Replication
OLTP System OLAP System
EDW (enterprise data
warehouse)
Why SQL-on-Hadoop?
 SQL (Structured Query Language) is the de facto language for transactional
and decision support systems and BI tools
 Healthcare analysts and hospital IT experts are very familiar with SQL
 SQL-on-Hadoop eases the transition to big data
 Little or no change to existing BI tools and applications
 SQL-on-Hadoop overcomes some shortcomings of conventional EDWs
 Scalability & fault tolerance
 Better support for semi-structured data
 Directly work on raw data (query in situ) by avoiding ETL
11
Open Data
SQL Layer
Remove Query
SQL-on-Hadoop Landscape
Impala
Big SQL PolyBase
Proprietary Data
Vortex
SQL-H
Spark SQL
MPP Query Engine
12
dashDB
Technical Challenge
 How to distribute data and computation in a large cluster of machines for performance
 Bottleneck: transferring large volumes of data across the network
 Example: join (combining columns from multiple tables)
13
PID VisitDate Reason
1 2016-03-15 Fever
2 2016-10-20 Headache
1 2017-02-08 Fever
3 2017-06-18 Cold
PID Name BOD Sex
1 Jim Green 1980-04-15 M
2 Alice Lee 1965-11-11 F
3 Rose Darcy 2001-07-21 F
PID VisitDate Reason Name BOD Sex
1 2016-03-15 Fever Jim Green 1980-04-15 M
2 2016-10-20 Headache Alice Lee 1965-11-11 F
1 2017-02-08 Fever Jim Green 1980-04-15 M
3 2017-06-18 Cold Rose Darcy 2001-07-21 F
Clinical Visits Patient Info
SQL-on-Hadoop Strategies (1/2)
 Storing data in formats that are easy for query processing
 Columnar data formats (Parquet, ORCFile)
 Pushing analytics close to the data
 Intelligent data readers (apply predicates and projections while read the data)
 Carefully choosing the algorithm and what data to transfer for each analytics operation
 E.g. how to choose from different join algorithms based on data characteristics
14
VS
Broadcast smaller table
network cost: 2|G|
Repartition both tables
network cost: 2/3|B|+2/3|G|
Blue table (B)
Green table (G)
SQL-on-Hadoop Strategies (2/2)
 Pre-process data into better organization for queries
 Hash or range-based data partitioning and bucketing
 Auxiliary data structures for eliminating unnecessary data access
 Indexing and synopsis
 Better data placement for related data
 E.g. collocating related data together on HDFS (Hadoop distributed file system)
15
Co-partition
network cost: |G|
21
1 1
12
2 2
3
3 3
3
Co-partition and co-location
network cost: 0
1
1 1
2
2 2
3
3 3
1 2 3
Big data analytics landscape
16
Structured Semi-structured Graph Text Multi-media
Date entry and
retrieval
OLTP (online
transactional
processing)
key-value/document
stores
graph databases keyword search …
Descriptive
summaries
SQL-on-Hadoop*: OLAP (online
analytical processing)
degree distribution,
clustering coefficient
distribution
word cloud …
Pattern discovery
(data mining)
DM on big data: frequent pattern
mining, anomaly detection, clustering
Graph processing: graph
clustering, influence
analysis
topic modeling,
sentiment analysis
…
Predictive analytics
(machine learning)
ML on big data: regression, classification, recommendation, link predication
Data Type
AnalyticsComplexity
Machine learning on big data
 SQL analytics tools are not enough to capture the full value of big data
 Big data impact on ML (machine learning):
 Opportunities:
 More training data  better predications
 We can train a model with billions of parameters, because we have sufficiently big data
 Making deep learning possible!
 Challenges:
 Scalability and distributed computing
 A big learning curve for data scientists
17
Machine Learning Deep Learning
Big ML systems landscape
18
Different levels of abstractions for big
ML systems
 ML libraries
 E.g. Spark MLlib, H2O, IBM Watson
 Provide a list of parameterized ML algorithms
 Declarative ML
 E.g. SystemML, Mahout
 Expose R or Matlab like language for users
 Primitive: linear algebra and math operations
 Cost-based optimizer to compile execution plans
 Also provide a library of ML algorithms
 AutoML
 E.g. H2O
 Automate the process of training a large selection of candidate models
19
Hadoop or
Spark Cluster
(scale-out)
In-Memory
Single Node
(scale-up)
Runtime
Compiler
Language
SystemML
Big data analytics landscape
20
Structured Semi-structured Graph Text Multi-media
Date entry and
retrieval
OLTP (online
transactional
processing)
key-value/document
stores
graph databases keyword search …
Descriptive
summaries
SQL-on-Hadoop*: OLAP (online
analytical processing)
degree distribution,
clustering coefficient
distribution
word cloud …
Pattern discovery
(data mining)
DM on big data: frequent pattern
mining, anomaly detection, clustering
Graph processing: graph
clustering, influence
analysis
topic modeling,
sentiment analysis
…
Predictive analytics
(machine learning)
ML on big data: regression, classification, recommendation, link predication
Data Type
AnalyticsComplexity
Graph analytics on big data
 Graphs provide a powerful primitive for modeling real-world objects and the
relationships between objects
 Patient-patient/doctor-patient interactions, biological pathways, protein
interaction networks, ontologies, knowledge graphs, etc
 Two types:
 Graph databases: focus on real-time graph analytics
 Graph processing systems: focus on batch processing of graphs
21
Graph databases
 Real-time graph analytics
 Updates, simple node and edge retrieval
 Pattern matching queries
 Given a graph pattern, find subgraphs in the database graphs that (exactly or
approximately) match the query
 Example: find out what biological processes are affected by a disease
 Querying a disease pathway against a database of known pathways
22
Graph Databases
SAGA (query
against a database
of pathways)
Graph processing systems
 Batch graph analytics
 Long running (usually iterative) analysis on the entire graph
 E.g. PageRank algorithm to identify key influencers of a disease propagation
network
 Performance bottleneck: network overhead
 Better graph partitioning and absorbing messages within a partition
 Combining messages (when messages can be aggregated)
23
Graph Processing
Microsoft
Graph Engine
Big data analytics landscape
24
Structured Semi-structured Graph Text Multi-media
Date entry and
retrieval
OLTP (online
transactional
processing)
key-value/document
stores
graph databases keyword search …
Descriptive
summaries
SQL-on-Hadoop*: OLAP (online
analytical processing)
degree distribution,
clustering coefficient
distribution
word cloud …
Pattern discovery
(data mining)
DM on big data: frequent pattern
mining, anomaly detection, clustering
Graph processing: graph
clustering, influence
analysis
topic modeling,
sentiment analysis
…
Predictive analytics
(machine learning)
ML on big data: regression, classification, recommendation, link predication
Data Type
AnalyticsComplexity
Integrated analytics
 An application often require different types of analytics together
 E.g. SQL is often used to prepare the data for ML
 An example: Medtronic & IBM Watson Health Partnership
 "gathers a patient’s readings from Medtronic insulin pumps and glucose monitors,
and combines them with information taken from the individual’s activity trackers
and diet. The system uses pattern recognition gleaned through IBM’s Watson to
provide feedback on how a patient can manage their diabetes”
 “Medtronic's insulin pumps using Watson artificial intelligence (AI) could warn
patients of abnormally low blood sugar levels up to three hours in advance”
25
References:
https://www.meddeviceonline.com/doc/ibm-watson-to-power-medtronic-s-diabetes-app-under-armour-s-fitness-app-0001
Solutions for Integrated analytics
 Integrating existing analytics systems
 Data transformation: transform the data format between different systems
 Data transfer: transfer the output of one system to another system
 Building a single system for various types of analytics
 E.g Spark, Wildfire (IBM Project EventStore)
26
Spark
OLAPOLTP ML Stream
Batch
GA
Shared Storage
Wildfire
Real
Time GA
Conclusion
 Big data analytics comes in different forms
 What types of data do you have?
 What level of complexity does the analytics require?
 What is the latency requirement?
 An application often require different types of analytics together
 What types of analytics do you need to integrate?
 What is your performance requirement?
 Do you need to integrating existing analytics pipelines or can you start with a
single systems that supports all analytics?
27
28

Contenu connexe

Tendances

Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
Revolution Analytics
 
What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2
Revolution Analytics
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
ivascucristian
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
towards_analytics_query_engine
towards_analytics_query_enginetowards_analytics_query_engine
towards_analytics_query_engine
Nantia Makrynioti
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Willy Marroquin (WillyDevNET)
 

Tendances (20)

R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
GraphLab Conference 2014 Keynote - Carlos Guestrin
GraphLab Conference 2014 Keynote - Carlos GuestrinGraphLab Conference 2014 Keynote - Carlos Guestrin
GraphLab Conference 2014 Keynote - Carlos Guestrin
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
About Streaming Data Solutions for Hadoop
About Streaming Data Solutions for HadoopAbout Streaming Data Solutions for Hadoop
About Streaming Data Solutions for Hadoop
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
GraphLab: Large-Scale Machine Learning on Graphs (BDT204) | AWS re:Invent 2013
GraphLab: Large-Scale Machine Learning on Graphs (BDT204) | AWS re:Invent 2013GraphLab: Large-Scale Machine Learning on Graphs (BDT204) | AWS re:Invent 2013
GraphLab: Large-Scale Machine Learning on Graphs (BDT204) | AWS re:Invent 2013
 
AI meets Big Data
AI meets Big DataAI meets Big Data
AI meets Big Data
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
R and Data Science
R and Data ScienceR and Data Science
R and Data Science
 
What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2
 
Moving From SAS to R Webinar Presentation - 07Aug14
Moving From SAS to R Webinar Presentation - 07Aug14Moving From SAS to R Webinar Presentation - 07Aug14
Moving From SAS to R Webinar Presentation - 07Aug14
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
Graph computation
Graph computationGraph computation
Graph computation
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
towards_analytics_query_engine
towards_analytics_query_enginetowards_analytics_query_engine
towards_analytics_query_engine
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics?
 

Similaire à Big Data Analytics: From SQL to Machine Learning and Graph Analysis

Cloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfCloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdf
kalai75
 

Similaire à Big Data Analytics: From SQL to Machine Learning and Graph Analysis (20)

Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 
Cloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfCloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdf
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
 
Accelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success StoriesAccelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success Stories
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Hd insight overview
Hd insight overviewHd insight overview
Hd insight overview
 
Big data and you
Big data and you Big data and you
Big data and you
 
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big Data
 
Sycamore Quantum Computer 2019 developed.pptx
Sycamore Quantum Computer 2019 developed.pptxSycamore Quantum Computer 2019 developed.pptx
Sycamore Quantum Computer 2019 developed.pptx
 
Big Data SE vs. SE for Big Data
Big Data SE vs. SE for Big DataBig Data SE vs. SE for Big Data
Big Data SE vs. SE for Big Data
 
Big data
Big dataBig data
Big data
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 
Big Data .. Are you ready for the next wave?
Big Data .. Are you ready for the next wave?Big Data .. Are you ready for the next wave?
Big Data .. Are you ready for the next wave?
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
 
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenariosThe Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
The Practice of Big Data - The Hadoop ecosystem explained with usage scenarios
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 

Dernier

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 

Big Data Analytics: From SQL to Machine Learning and Graph Analysis

  • 1. © 2017 IBM Corporation Big Data Analytics: From SQL to Machine Learning and Graph Analysis Yuanyuan Tian IBM Research -- Almaden Keynote for KDD bigdas 2017
  • 2. A bit about me  I am a computer scientist who builds data management and analytics systems  My talk is from the perspective of a big data analytics system builder  I have some exposure to healthcare domain data and analytics problems by collaborating with experts in IBM Watson Health division 2
  • 3. What is big data?  Gartner’s 3Vs definition:  “Big Data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.”  Extra Vs  Variability, Veracity, Visualization, Value  How big is big data?  It is all relative  It is always a moving definition  It is not all about the size  My answer: when conventional data management and analytics tools are inadequate = big data 3 Figure from https://www.linguamatics.com/blog/big-data-real-world-data- where-does-text-analytics-fit Big Data 3Vs
  • 4. Why is big data important for health care?  Large volumes of data  eHealth  mHealth  Sensor & wearable technologies  Genome sequencing  New applications  Personalized medicine  Clinical risk intervention  Predictive analytics 4 Big Data
  • 5. Big data analytics  Big data analytics comes in different forms! 5
  • 6. Two dimensions of big data analytics  Data type  Structured data  Records in relational database tables  Semi-structured data  Json and XML  Unstructured data  Text data  Graph data  Social and interaction data  Multi-media data  Images and videos  Complexity of analytics  Data entry and retrieval  Look for a patient’s EHR at check in  Descriptive summaries  Compute the number of outbreaks across different geo regions  Pattern discovery (data mining)  Identify unusual patterns of medical claims by clinics, physicians, labs, etc  Predictive analytics (machine learning)  Predict a patient’s readmission to the hospital 6
  • 7. Big data analytics landscape 7 Structured Semi-structured Graph Text Multi-media Date entry and retrieval OLTP (online transactional processing) key-value/document stores graph databases keyword search … Descriptive summaries SQL-on-Hadoop*: OLAP (online analytical processing) degree distribution, clustering coefficient distribution word cloud … Pattern discovery (data mining) DM on big data: frequent pattern mining, anomaly detection, clustering Graph processing: graph clustering, influence analysis topic modeling, sentiment analysis … Predictive analytics (machine learning) ML on big data: regression, classification, recommendation, link predication Data Type AnalyticsComplexity
  • 8. Big data analytics landscape 8 Structured Semi-structured Graph Text Multi-media Date entry and retrieval OLTP (online transactional processing) key-value/document stores graph databases keyword search … Descriptive summaries SQL-on-Hadoop*: OLAP (online analytical processing) degree distribution, clustering coefficient distribution word cloud … Pattern discovery (data mining) DM on big data: frequent pattern mining, anomaly detection, clustering Graph processing: graph clustering, influence analysis topic modeling, sentiment analysis … Predictive analytics (machine learning) ML on big data: regression, classification, recommendation, link predication Data Type AnalyticsComplexity
  • 9. Big data analytics landscape 9 Structured Semi-structured Graph Text Multi-media Date entry and retrieval OLTP (online transactional processing) key-value/document stores graph databases keyword search … Descriptive summaries SQL-on-Hadoop*: OLAP (online analytical processing) degree distribution, clustering coefficient distribution word cloud … Pattern discovery (data mining) DM on big data: frequent pattern mining, anomaly detection, clustering Graph processing: graph clustering, influence analysis topic modeling, sentiment analysis … Predictive analytics (machine learning) ML on big data: regression, classification, recommendation, link predication Data Type AnalyticsComplexity
  • 10. Background on traditional SQL processing  OLTP (online transactional processing) vs OLAP (online analytical processing)  Specialized OLTP and OLAP systems connected by the ETL (extract, transform, load) process 10 Purpose Queries Speed OLTP Data entry and retrieval Simple read, insert, update and delete Real-time (low latency and high throughput) OLAP BI (business intelligence) or reporting More complex analytical and ad hoc queries (mostly optimized for read) Interactive Transactions Analytic Queries ETL / Replication OLTP System OLAP System EDW (enterprise data warehouse)
  • 11. Why SQL-on-Hadoop?  SQL (Structured Query Language) is the de facto language for transactional and decision support systems and BI tools  Healthcare analysts and hospital IT experts are very familiar with SQL  SQL-on-Hadoop eases the transition to big data  Little or no change to existing BI tools and applications  SQL-on-Hadoop overcomes some shortcomings of conventional EDWs  Scalability & fault tolerance  Better support for semi-structured data  Directly work on raw data (query in situ) by avoiding ETL 11
  • 12. Open Data SQL Layer Remove Query SQL-on-Hadoop Landscape Impala Big SQL PolyBase Proprietary Data Vortex SQL-H Spark SQL MPP Query Engine 12 dashDB
  • 13. Technical Challenge  How to distribute data and computation in a large cluster of machines for performance  Bottleneck: transferring large volumes of data across the network  Example: join (combining columns from multiple tables) 13 PID VisitDate Reason 1 2016-03-15 Fever 2 2016-10-20 Headache 1 2017-02-08 Fever 3 2017-06-18 Cold PID Name BOD Sex 1 Jim Green 1980-04-15 M 2 Alice Lee 1965-11-11 F 3 Rose Darcy 2001-07-21 F PID VisitDate Reason Name BOD Sex 1 2016-03-15 Fever Jim Green 1980-04-15 M 2 2016-10-20 Headache Alice Lee 1965-11-11 F 1 2017-02-08 Fever Jim Green 1980-04-15 M 3 2017-06-18 Cold Rose Darcy 2001-07-21 F Clinical Visits Patient Info
  • 14. SQL-on-Hadoop Strategies (1/2)  Storing data in formats that are easy for query processing  Columnar data formats (Parquet, ORCFile)  Pushing analytics close to the data  Intelligent data readers (apply predicates and projections while read the data)  Carefully choosing the algorithm and what data to transfer for each analytics operation  E.g. how to choose from different join algorithms based on data characteristics 14 VS Broadcast smaller table network cost: 2|G| Repartition both tables network cost: 2/3|B|+2/3|G| Blue table (B) Green table (G)
  • 15. SQL-on-Hadoop Strategies (2/2)  Pre-process data into better organization for queries  Hash or range-based data partitioning and bucketing  Auxiliary data structures for eliminating unnecessary data access  Indexing and synopsis  Better data placement for related data  E.g. collocating related data together on HDFS (Hadoop distributed file system) 15 Co-partition network cost: |G| 21 1 1 12 2 2 3 3 3 3 Co-partition and co-location network cost: 0 1 1 1 2 2 2 3 3 3 1 2 3
  • 16. Big data analytics landscape 16 Structured Semi-structured Graph Text Multi-media Date entry and retrieval OLTP (online transactional processing) key-value/document stores graph databases keyword search … Descriptive summaries SQL-on-Hadoop*: OLAP (online analytical processing) degree distribution, clustering coefficient distribution word cloud … Pattern discovery (data mining) DM on big data: frequent pattern mining, anomaly detection, clustering Graph processing: graph clustering, influence analysis topic modeling, sentiment analysis … Predictive analytics (machine learning) ML on big data: regression, classification, recommendation, link predication Data Type AnalyticsComplexity
  • 17. Machine learning on big data  SQL analytics tools are not enough to capture the full value of big data  Big data impact on ML (machine learning):  Opportunities:  More training data  better predications  We can train a model with billions of parameters, because we have sufficiently big data  Making deep learning possible!  Challenges:  Scalability and distributed computing  A big learning curve for data scientists 17
  • 18. Machine Learning Deep Learning Big ML systems landscape 18
  • 19. Different levels of abstractions for big ML systems  ML libraries  E.g. Spark MLlib, H2O, IBM Watson  Provide a list of parameterized ML algorithms  Declarative ML  E.g. SystemML, Mahout  Expose R or Matlab like language for users  Primitive: linear algebra and math operations  Cost-based optimizer to compile execution plans  Also provide a library of ML algorithms  AutoML  E.g. H2O  Automate the process of training a large selection of candidate models 19 Hadoop or Spark Cluster (scale-out) In-Memory Single Node (scale-up) Runtime Compiler Language SystemML
  • 20. Big data analytics landscape 20 Structured Semi-structured Graph Text Multi-media Date entry and retrieval OLTP (online transactional processing) key-value/document stores graph databases keyword search … Descriptive summaries SQL-on-Hadoop*: OLAP (online analytical processing) degree distribution, clustering coefficient distribution word cloud … Pattern discovery (data mining) DM on big data: frequent pattern mining, anomaly detection, clustering Graph processing: graph clustering, influence analysis topic modeling, sentiment analysis … Predictive analytics (machine learning) ML on big data: regression, classification, recommendation, link predication Data Type AnalyticsComplexity
  • 21. Graph analytics on big data  Graphs provide a powerful primitive for modeling real-world objects and the relationships between objects  Patient-patient/doctor-patient interactions, biological pathways, protein interaction networks, ontologies, knowledge graphs, etc  Two types:  Graph databases: focus on real-time graph analytics  Graph processing systems: focus on batch processing of graphs 21
  • 22. Graph databases  Real-time graph analytics  Updates, simple node and edge retrieval  Pattern matching queries  Given a graph pattern, find subgraphs in the database graphs that (exactly or approximately) match the query  Example: find out what biological processes are affected by a disease  Querying a disease pathway against a database of known pathways 22 Graph Databases SAGA (query against a database of pathways)
  • 23. Graph processing systems  Batch graph analytics  Long running (usually iterative) analysis on the entire graph  E.g. PageRank algorithm to identify key influencers of a disease propagation network  Performance bottleneck: network overhead  Better graph partitioning and absorbing messages within a partition  Combining messages (when messages can be aggregated) 23 Graph Processing Microsoft Graph Engine
  • 24. Big data analytics landscape 24 Structured Semi-structured Graph Text Multi-media Date entry and retrieval OLTP (online transactional processing) key-value/document stores graph databases keyword search … Descriptive summaries SQL-on-Hadoop*: OLAP (online analytical processing) degree distribution, clustering coefficient distribution word cloud … Pattern discovery (data mining) DM on big data: frequent pattern mining, anomaly detection, clustering Graph processing: graph clustering, influence analysis topic modeling, sentiment analysis … Predictive analytics (machine learning) ML on big data: regression, classification, recommendation, link predication Data Type AnalyticsComplexity
  • 25. Integrated analytics  An application often require different types of analytics together  E.g. SQL is often used to prepare the data for ML  An example: Medtronic & IBM Watson Health Partnership  "gathers a patient’s readings from Medtronic insulin pumps and glucose monitors, and combines them with information taken from the individual’s activity trackers and diet. The system uses pattern recognition gleaned through IBM’s Watson to provide feedback on how a patient can manage their diabetes”  “Medtronic's insulin pumps using Watson artificial intelligence (AI) could warn patients of abnormally low blood sugar levels up to three hours in advance” 25 References: https://www.meddeviceonline.com/doc/ibm-watson-to-power-medtronic-s-diabetes-app-under-armour-s-fitness-app-0001
  • 26. Solutions for Integrated analytics  Integrating existing analytics systems  Data transformation: transform the data format between different systems  Data transfer: transfer the output of one system to another system  Building a single system for various types of analytics  E.g Spark, Wildfire (IBM Project EventStore) 26 Spark OLAPOLTP ML Stream Batch GA Shared Storage Wildfire Real Time GA
  • 27. Conclusion  Big data analytics comes in different forms  What types of data do you have?  What level of complexity does the analytics require?  What is the latency requirement?  An application often require different types of analytics together  What types of analytics do you need to integrate?  What is your performance requirement?  Do you need to integrating existing analytics pipelines or can you start with a single systems that supports all analytics? 27
  • 28. 28

Notes de l'éditeur

  1. I will try to provide a roadmap in this talk to help you navigate through the big data analytics landscape. I'm not a healthcare domain expert, however, I have some exposure to healthcare domain data and analytics problems and have been collaborating with experts Watson Health division to formulate the talk
  2. The first question before we talk about big data analytics is what is big data? The most popular definiion is the 3V defitnion from Gartner. And over the years, others have extended the deinfition of big data with more vs. The next question people usually ask is how do you know you have big data? How big is big data? Well, it is all relative, and with the technology advancement in storage and data processing, it is always moving definiton. 10 years ago, people think 1 petabyte of data so huge, nowdays is becoming very common, now people start to talk about exabyte, and even zettabyte. And as we have seen the 3 v defintion, its not all about size. So, how big is big data? There is agreed upon answer. My answer to this question is that you know you are dealing with big data when the convention data management and analytics tools are not enough. Volume - The quantity of generated and stored data. The size of the data determines the value and potential insight- and whether it can actually be considered big data or not. Variety - The type and nature of the data. This helps people who analyze it to effectively use the resulting insight. Velocity - In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Variability - nconsistency of the data set can hamper processes to handle and manage it. Veracity - The quality of captured data can vary greatly, affecting the accurate analysis.
  3. Large volumes of data are being accumlated in the healthcare domain, due to ehealth, mobile health, the wide use of senor and wearable technologis, and the advancement of genome sequencing. In addition, a number of new healthcare applications emerge because of big data, such as
  4. Big data analytics can mean different things to different people. For some people it is machine learning, others it may be SQL analytics. It is not suprising, because big data analytics comes in different forms. In this talk, I will categorize big data analytics.
  5. I will categorize big data analytics along two dimensions. The other dimension is complexity of analytics, starting from simple to more complex. The simplest type of analytics does update and data retrieval, for example, retrieving a patients HER record when she checks in a hostpital. The next type is creating descriptive summaries, which groups data and computing statistics. The next level goes beyong computing simple statistics to discovering patterns using data mining techniques. For example, for fraud detection, identifying unusual patterns of medical claims by clincs, phisicans, labs,so on. The last level is predicative analytics using machine learning techniques. For example, predicint whether a patient will be readmitted to the hospital based on the history data.
  6. Now here is the big data analytcs landscape along the two dimensions. The horizental dimension is data type, and the vertical dimension is the analytic complexity. I am not familary with image or video processing, so I am going to leave multi-media out. For structured data, data entry and retrievel is basically oLTP, and for semi-structured data, people use key-value/document store, such as cassandra or mogodb for data entry and retrieval. For graph data entry and retrievel, people use graph databases, like neo4j and janusgraph. For text data, people use search systems for keyword search. For both structured and unstructure data, people use SQL-on-Hadoop systems for descriptive sumaries. They basically do OLAP Notice the the start next Hadoop? Here Hadoop is abused to represent big data, many SQ-on-Hadoop systems are not really using hadoop underneath. For graphs, people basically compute …, and for text, word cloud is most widely used method to compute descriptive summaries. For structured and unstructured data, people use data minging on big data for ...., for graphs, people use graph processing systems for graph clustering, influence analysis etc. Examples of pattern discovery on text are topic modeling and sentiment analysis. For predictive analytics, big ML systems for used for all different types of data, but depending on the actual data types, you may need to do some data transformation to be able to use ML.
  7. Over the years, I have worked on a number of types of big data analytics. I will cover these types in this talk.
  8. Now here is the big data analytcs landscape along the two dimensions. The horizental dimension is data type, and the vertical dimension is the analytic complexity. I am not familary with image or video processing, so I am going to leave multi-media out. For structured data, data entry and retrievel is basically oLTP, and for semi-structured data, people use key-value/document store, such as cassandra or mogodb for data entry and retrieval. For graph data entry and retrievel, people use graph databases, like neo4j and janusgraph. For text data, people use search systems for keyword search. For both structured and unstructure data, people use SQL-on-Hadoop systems for descriptive sumaries. They basically do OLAP Notice the the start next Hadoop? Here Hadoop is abused to represent big data, many SQ-on-Hadoop systems are not really using hadoop underneath. For graphs, people basically compute …, and for text, word cloud is most widely used method to compute descriptive summaries. For structured and unstructured data, people use data minging on big data for ...., for graphs, people use graph processing systems for graph clustering, influence analysis etc. Examples of pattern discovery on text are topic modeling and sentiment analysis. For predictive analytics, big ML systems for used for all different types of data, but depending on the actual data types, you may need to do some data transformation to be able to use ML.
  9. Before, I talk about SQL-on-Haoop, I will briefly provide some background info on tranditional SQL processing. In traditional SQL, there are two types: OLTP and OLAP. And the difference between them is listed in this table. Because OLTP and OLAP sytems have very different characteristics, the database field has evloved into having specialed OLTP systems and OLAP systems. And ETL process is used consolidcate and transform transactional data from OLTP sysytem to OLAP systems. Name any application in use at a hospital or in a physician’s office, and the chances are good that it runs on an OLTP database. EHRs, lab systems, financial systems, patient satisfaction systems, patient identification, billing and payment processing, ect.
  10. SQL (Structured Query Language) is the de facto language for transactional and decision support systems and BI tools to access and query a variety of data sources Transitioning to big data requires a steep learning curve,
  11. SQL-on-hadoop systems support data warehousing functionalities on big data, I,e. they focus olap queries. There are so many SQL-on-Hadoop systems today. The can be categrozed in several camps. The first camp support querying exsiting data in open format, there is no lock-in. The camp can be further categorized in sub-groups, with first group just builds a simple SQL layer on existing data platforms like HIVe builing on mapreduce, Spark SQL building on Spark, where as the second subgroup builds a MPP query engine from scrach, The second group typical have better performance. The second camp controles the storage layer and uses propriety formats. The last camp extends existing EDW to work with big data. Querying existing data with open format vs controlled storage layer with proprietary formats? A SQL layer on top of existing big data systems (like MapReduce or Spark) or a MPP query engine architected from ground up? Directly querying big data vs going through an existing database?
  12. The major technical challenge for SQL-on-Hadoop systems is to distribute data and computation in… Quite often the major bottleneck is transferring large volumes of data across the network. Let’s use the database join operator to illustrate this challenge. Join is a database operator to combing columns from multiple tables togather. For example, one can join the clinical visits and patient info tables on the patient id. The join will bring in the records with same pid together. In the big data setting, the two tables are partitioned and distributed across the cluster, so the join processing needs to transfer data acros the network to actually performan the query.
  13. Here are some strategies applied in many SQL-on-Hadoop systems to address the changelles. For example, in the past, I have worked on comparing different join algorithms for big data and provid guidelines on how to choose from differen join algorithms for a particular query based on data characteristics. For example, one join strategy for joining a big table with a small table is broacasting the smaller table to all machines in the cluster, then perform local joins on each machne. In this Figure, I have two tables, a blue table and a green table, they are all distributed across the machines in the cluster, in this particular case, the green table is the smaller table, so I will ship all the partitions of the green table to every node. The red arrow represnts the network communication. The algorithm in total sends 2 times of the size of the Green table across the network. Then there is another join strategy that is good for joining two large tables. This algorithm repartitions both tables, and send the corresponding partitions from both table to one of the machines for processing. This algorithm will end up sening in total this much table across the network. As you can see, depends on the size of the two tables, one algorithm may be perfered for particular join operation.
  14. Data partitioning is to partition data based on some values, instead of randomly partition data. For when two tables are partitioned the same way on the join key, you only need to bring in the corresponding partitions together for join processing. This often will reduce the processing and network overhead. Finally, better data placement can often bring in siginifcant performane boost. For example, in one of my works, I extended the HDFS to support collocation of related data in an best-effort approach. And using this technique can signifcantly reduce the network overhead. In this sample, not only the two tables are co-partitioned, but the corresponding partitons are also collocated, so when join the two tables togeher, no network cost is incurred.
  15. We have talked so much on SQL-on-Hadoop, let’s now move on to machine learning on big data.
  16. Here is where Machine learning comes to help. Machine learning is not a new field. But Big data has brought in huge impact on machine learning. First of all, It revived the whole machine learning field, because more training data usually leads to better preidcations. And now we have enough data to train a model with billions of parameters. Big data essentially enabled deep learning. At the same time, big data also bring in a lot challenges to machine learning as well, such as scalabitiy and distributed computation. And more importantly, it emposes a big learning curve for data scientss, because they do not only need to worry about the particular ML algorithm, but how to distirbuted the data and computation in the big data plaftform.
  17. To help reduce the learning curve many big ML systems emerged. They are usually categoried into two camps. One camp for general machine learning, the other camp specialled in deep learning. But the trend now is that two camps start to converge together, with general ML systems start to support deep learning, and the deep learning camp also start to support general ml algorithms. Personally, I haven’t worked much on deep learning, so I will focus on the general ML camp.
  18. The big ml systems help data scientists by masking the details of implementing ml algorithms for big data. There are different levels of abstractions that big ML systems provide. One grpup of big machine learning systems provide the users with a library of machine learning lagorithms. The behavior of each algorithm can be controlled by the parameters. But that’s it. The algorithm are pretty much black boxes for the users. There is no way to change the internals of the algorithm. This problem is addressed by declaraive ml systems, like systemML and mahout. These systems usuallly expose an R-or matlab like language for data scientists, with linear algebra and math operations as the primative. The the system employs a cost-based optimizer to compile the algorithm into effcient execution plans on the target platform. Finally, recently, H2O has propopsed a new concept called AutoML. Usually, for a particular application, a data scientist usually tries a large number of candidate models and selects the best. AutoML basically automates this process.
  19. Next, I will briefly talk about graph databases and grpah processing together.
  20. Popular graph databases include Neo4j, janus graph, ibm graph etc. They focus on real-time graph analytics. Besides upates and simple node ane edge retrievel, most graph databases support graph pattern matching query. Basically, given a graph pattern, they find subgraphs in the database that match the query. Most graph database only support exact match. But sometimes, approximate match necessary when graph data is noisy. In one of my PhD work, I have built a system called SAGA for approxiate graph matchng, And it can support querying a disease parthy againse a data of known pathways to find out what biologial processed are affected by a disease.
  21. The second type of graph analytics systems are graph processing systems. They focus on batch graph analytics. These are long running analysis on the entire grpah, They are often iterative. Also new trend in these types of sytems is to deal
  22. Over the years, I have worked on a number of types of big data analytics. I will cover these types in this talk.
  23. The first way solution is to integrating existing analytics systems together. The two challenges here is
  24. The take away of messge of my talk is that big