SlideShare a Scribd company logo
1 of 26
Trade-offs in Automatic
Provenance Capture
Manolis Stamatogiannakis, Hasanat Kazmi,
Hashim Sharif, Remco Vermeulen,
Ashish Gehani, Herbert Bos, and Paul Groth
Capturing Provenance
Disclosed Provenance
+ Accuracy
+ High-level semantics
– Intrusive
– Manual Effort
Observed Provenance
– False positives
– Semantic Gap
+ Non-intrusive
+ Minimal manual effort
CPL (Macko ‘12)
Trio (Widom ‘09)
PrIME (Miles ‘09)
Taverna (Oinn ‘06)
VisTrails (Fraire ‘06)
ES3 (Frew ‘08)
Trec (Vahdat ‘98)
PASSv2 (Holland ‘08)
DTrace Tool (Gessiou ‘12)
2
OPUS (Balakrishnan ‘ 13)
https://github.com/ashish-
gehani/SPADE/wiki
• Strace Reporter
– Programs run under strace. Produced log is parsed
to extract provenance.
• LLVMTrace
– Instrumentation added to function boundaries at
compile time.
• DataTracker
– Dynamic Taint Analysis. Bytes associated with
metadata which are propagated as the program
executes.
3
SPADEv2 – Provenance
Collection
SPADEv2 flow
4
Current Intuition
5
Current Intuition
6
Incomplete Picture
• Faster, but how much?
• What is the performance “price” for fewer
false positives?
• Does a compile-time solution worth the
effort?
7
How can one get more
insight?
Run a benchmark!
8
Which one?
• LMBench, UnixBench, Postmark, BLAST,
SPECint…
• [Traeger 08]: “Most popular benchmarks
are flawed.”
• No-matter what you chose, there will be
blind spots.
9
Start simple: UnixBench
• Well understood sub-benchmarks.
• Emphasizes on performance of system calls.
• System calls are commonly used for the
extraction of provenance.
• More insight on which collection backend
would suit specific applications.
• We’ll have a performance baseline to
improve the specific implementations.
10
UnixBench Results
11
TRADEOFFS
12
Performance vs. Integration
Effort
• Capturing provenance from completely
unmodified programs may degrade
performance.
• Modification of either the source
(LLVMTrace) or the platform (LPM, Hi-Fi)
should be considered for a production
deployment.
13
Performance vs. Provenance
Granularity
• We couldn’t verify this intuition for the case
of strace reoporter compared to
LLVMTrace.
– Strace reporter implementation is not optimal.
• Tracking fine-grained provenance may
interfere with existing optimizations.
– E.g. buffering I/O does not benefit
DataTracker.
14
Performance vs.
False Positives/Analysis Scope
• “Brute-forcing” a low false-positive ratio with the
“track everything” approach of DataTracker is
prohibitively expensive.
• Limiting the analysis scope gives a performance
boost.
• If we exploit known semantics, we can have the
best of both worlds.
– Pre-existing semantic knowledge: LLVMTrace
– Dynamically acquired knowledge: ProTracer [Ma
2016]
15
TAKEAWAYS
16
Takeaway: System Event
Tracing
• A good start for quick deployments
• Simple versions may be expensive
• What happens in the binary?
17
Takeaway: Compile-time
Instrumentation
• Middle-ground between disclosed and
automatic provenance collection.
• But you have to have access to source
18
Takeaway: Taint Analysis
• Prohibitively expensive
for computation-
intensive programs.
• Likely to remain so,
even after optimizations.
• Reserved for
provenance analysis of
unknown/legacy
software.
• Offline approach
(Stamatogiannakis
TAPP’15)
19
Generalizing the Results
• Only one implementation
was tested for each method.
• Repeating testing with
alternative implementations
will provide confidence for
the insights gained.
• More confidence when
choosing a specific collection
method.
20
Different methods
Differentimplementations
Implementation Details Matter
• Our results are influenced by the specifics
of the implementation.
• Anecdote: The initial implementation of
LLVMTrace was actually slower than
strace reporter.
21
Provenance Quality
• Qualitative features of the
provenance are also very
important.
• How many vertices/edges are
contained in the generated
provenance graph?
• Precision/Recall based on
provenance ground truth.
22
Performance Benchmarks
QualitativeBenchmarks
Where to go next?
• UnixBench is a basic benchmark.
• SPEC: Comprehensive in terms of
performance evaluation.
– Hard to get the provenance ground truth –
assess quality of captured provenance.
• Better directions:
– Coreutils based micro-benchmarks.
– Macro-benchmarks (e.g. Postmark,
compilation benchmarks).
23
Conclusion
• Automatic provenance capture is an
important part of the ecosystem
• Trade-offs in different capture modes
• Benchmarking – to inform
• Common platforms are essential
24
The End
25
UnixBench Results
26

More Related Content

What's hot

A novel approach to prevent cache based side-channel attack in the cloud (1)
A novel approach to prevent cache based side-channel attack in the cloud (1)A novel approach to prevent cache based side-channel attack in the cloud (1)
A novel approach to prevent cache based side-channel attack in the cloud (1)mrigakshi goel
 
Tempesta FW: a FrameWork and FireWall for HTTP DDoS mitigation and Web Applic...
Tempesta FW: a FrameWork and FireWall for HTTP DDoS mitigation and Web Applic...Tempesta FW: a FrameWork and FireWall for HTTP DDoS mitigation and Web Applic...
Tempesta FW: a FrameWork and FireWall for HTTP DDoS mitigation and Web Applic...Alexander Krizhanovsky
 
Labmeeting - 20151013 - Adaptive Video Streaming over HTTP with Dynamic Resou...
Labmeeting - 20151013 - Adaptive Video Streaming over HTTP with Dynamic Resou...Labmeeting - 20151013 - Adaptive Video Streaming over HTTP with Dynamic Resou...
Labmeeting - 20151013 - Adaptive Video Streaming over HTTP with Dynamic Resou...Syuan Wang
 
Web application security and Python security best practices
Web application security and Python security best practicesWeb application security and Python security best practices
Web application security and Python security best practicesPGS Software S.A.
 
State of Crypto in Python
State of Crypto in PythonState of Crypto in Python
State of Crypto in Pythonjarito030506
 
Hs java open_party
Hs java open_partyHs java open_party
Hs java open_partyOpen Party
 

What's hot (7)

A novel approach to prevent cache based side-channel attack in the cloud (1)
A novel approach to prevent cache based side-channel attack in the cloud (1)A novel approach to prevent cache based side-channel attack in the cloud (1)
A novel approach to prevent cache based side-channel attack in the cloud (1)
 
Tempesta FW: a FrameWork and FireWall for HTTP DDoS mitigation and Web Applic...
Tempesta FW: a FrameWork and FireWall for HTTP DDoS mitigation and Web Applic...Tempesta FW: a FrameWork and FireWall for HTTP DDoS mitigation and Web Applic...
Tempesta FW: a FrameWork and FireWall for HTTP DDoS mitigation and Web Applic...
 
Labmeeting - 20151013 - Adaptive Video Streaming over HTTP with Dynamic Resou...
Labmeeting - 20151013 - Adaptive Video Streaming over HTTP with Dynamic Resou...Labmeeting - 20151013 - Adaptive Video Streaming over HTTP with Dynamic Resou...
Labmeeting - 20151013 - Adaptive Video Streaming over HTTP with Dynamic Resou...
 
Synchronization
SynchronizationSynchronization
Synchronization
 
Web application security and Python security best practices
Web application security and Python security best practicesWeb application security and Python security best practices
Web application security and Python security best practices
 
State of Crypto in Python
State of Crypto in PythonState of Crypto in Python
State of Crypto in Python
 
Hs java open_party
Hs java open_partyHs java open_party
Hs java open_party
 

Viewers also liked

Sources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsSources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsPaul Groth
 
Knowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPediaKnowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPediaPaul Groth
 
Telling your research story with (alt)metrics
Telling your research story with (alt)metricsTelling your research story with (alt)metrics
Telling your research story with (alt)metricsPaul Groth
 
Altmetrics: painting a broader picture of impact
Altmetrics: painting a broader picture of impactAltmetrics: painting a broader picture of impact
Altmetrics: painting a broader picture of impactPaul Groth
 
Transparency in the Data Supply Chain
Transparency in the Data Supply ChainTransparency in the Data Supply Chain
Transparency in the Data Supply ChainPaul Groth
 
Data Integration vs Transparency: Tackling the tension
Data Integration vs Transparency: Tackling the tensionData Integration vs Transparency: Tackling the tension
Data Integration vs Transparency: Tackling the tensionPaul Groth
 
"Don't Publish, Release" - Revisited
"Don't Publish, Release" - Revisited "Don't Publish, Release" - Revisited
"Don't Publish, Release" - Revisited Paul Groth
 
Information architecture at Elsevier
Information architecture at ElsevierInformation architecture at Elsevier
Information architecture at ElsevierPaul Groth
 
Data for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchersData for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchersPaul Groth
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkPaul Groth
 
Structured Data & the Future of Educational Material
Structured Data & the Future of Educational MaterialStructured Data & the Future of Educational Material
Structured Data & the Future of Educational MaterialPaul Groth
 
Knowledge Graphs at Elsevier
Knowledge Graphs at ElsevierKnowledge Graphs at Elsevier
Knowledge Graphs at ElsevierPaul Groth
 
Open PHACTS API Walkthrough
Open PHACTS API WalkthroughOpen PHACTS API Walkthrough
Open PHACTS API WalkthroughPaul Groth
 
Ideals and Norms in Scholarship
Ideals and Norms in ScholarshipIdeals and Norms in Scholarship
Ideals and Norms in ScholarshipPaul Groth
 
Machine Reading: What it means for publishers?
Machine Reading: What it means for publishers?Machine Reading: What it means for publishers?
Machine Reading: What it means for publishers?Paul Groth
 
How much does it cost sspmeeting may2015_kiley
How much does it cost sspmeeting may2015_kileyHow much does it cost sspmeeting may2015_kiley
How much does it cost sspmeeting may2015_kileyRobert Kiley
 
Ilya Repin A Painter from Ukraine (Version with pictures)
Ilya Repin A Painter from Ukraine (Version with pictures)Ilya Repin A Painter from Ukraine (Version with pictures)
Ilya Repin A Painter from Ukraine (Version with pictures)Thomas M. Prymak
 
Faculdade max planck_- virtude moral e ética
Faculdade max planck_- virtude moral e éticaFaculdade max planck_- virtude moral e ética
Faculdade max planck_- virtude moral e éticaDenise Juanilla Camargo
 

Viewers also liked (20)

Sources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsSources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization Systems
 
Knowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPediaKnowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPedia
 
Telling your research story with (alt)metrics
Telling your research story with (alt)metricsTelling your research story with (alt)metrics
Telling your research story with (alt)metrics
 
Altmetrics: painting a broader picture of impact
Altmetrics: painting a broader picture of impactAltmetrics: painting a broader picture of impact
Altmetrics: painting a broader picture of impact
 
Transparency in the Data Supply Chain
Transparency in the Data Supply ChainTransparency in the Data Supply Chain
Transparency in the Data Supply Chain
 
Data Integration vs Transparency: Tackling the tension
Data Integration vs Transparency: Tackling the tensionData Integration vs Transparency: Tackling the tension
Data Integration vs Transparency: Tackling the tension
 
"Don't Publish, Release" - Revisited
"Don't Publish, Release" - Revisited "Don't Publish, Release" - Revisited
"Don't Publish, Release" - Revisited
 
Information architecture at Elsevier
Information architecture at ElsevierInformation architecture at Elsevier
Information architecture at Elsevier
 
Data for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchersData for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchers
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic Framework
 
Structured Data & the Future of Educational Material
Structured Data & the Future of Educational MaterialStructured Data & the Future of Educational Material
Structured Data & the Future of Educational Material
 
Knowledge Graphs at Elsevier
Knowledge Graphs at ElsevierKnowledge Graphs at Elsevier
Knowledge Graphs at Elsevier
 
Open PHACTS API Walkthrough
Open PHACTS API WalkthroughOpen PHACTS API Walkthrough
Open PHACTS API Walkthrough
 
Ideals and Norms in Scholarship
Ideals and Norms in ScholarshipIdeals and Norms in Scholarship
Ideals and Norms in Scholarship
 
Machine Reading: What it means for publishers?
Machine Reading: What it means for publishers?Machine Reading: What it means for publishers?
Machine Reading: What it means for publishers?
 
Truth management system
Truth  management systemTruth  management system
Truth management system
 
Ch 6 final
Ch 6 finalCh 6 final
Ch 6 final
 
How much does it cost sspmeeting may2015_kiley
How much does it cost sspmeeting may2015_kileyHow much does it cost sspmeeting may2015_kiley
How much does it cost sspmeeting may2015_kiley
 
Ilya Repin A Painter from Ukraine (Version with pictures)
Ilya Repin A Painter from Ukraine (Version with pictures)Ilya Repin A Painter from Ukraine (Version with pictures)
Ilya Repin A Painter from Ukraine (Version with pictures)
 
Faculdade max planck_- virtude moral e ética
Faculdade max planck_- virtude moral e éticaFaculdade max planck_- virtude moral e ética
Faculdade max planck_- virtude moral e ética
 

Similar to Tradeoffs in Automatic Provenance Capture

SmartData Webinar: Applying Neocortical Research to Streaming Analytics
SmartData Webinar: Applying Neocortical Research to Streaming AnalyticsSmartData Webinar: Applying Neocortical Research to Streaming Analytics
SmartData Webinar: Applying Neocortical Research to Streaming AnalyticsDATAVERSITY
 
Predictive Analytics with Numenta Machine Intelligence
Predictive Analytics with Numenta Machine IntelligencePredictive Analytics with Numenta Machine Intelligence
Predictive Analytics with Numenta Machine IntelligenceNumenta
 
Towards a Threat Hunting Automation Maturity Model
Towards a Threat Hunting Automation Maturity ModelTowards a Threat Hunting Automation Maturity Model
Towards a Threat Hunting Automation Maturity ModelAlex Pinto
 
Hadoop / Spark on Malware Expression
Hadoop / Spark on Malware ExpressionHadoop / Spark on Malware Expression
Hadoop / Spark on Malware ExpressionMapR Technologies
 
Cansec West 2009
Cansec West 2009Cansec West 2009
Cansec West 2009abhicc285
 
Forensic Analysis - Empower Tech Days 2013
Forensic Analysis - Empower Tech Days 2013Forensic Analysis - Empower Tech Days 2013
Forensic Analysis - Empower Tech Days 2013Islam Azeddine Mennouchi
 
The data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architecturesThe data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architecturesVincenzo Gulisano
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationPaul Groth
 
PAACDA Comprehensive Data Corruption Detection Algorithm.docx
PAACDA Comprehensive Data Corruption Detection Algorithm.docxPAACDA Comprehensive Data Corruption Detection Algorithm.docx
PAACDA Comprehensive Data Corruption Detection Algorithm.docxShakas Technologies
 
Navy security contest-bigdataforsecurity
Navy security contest-bigdataforsecurityNavy security contest-bigdataforsecurity
Navy security contest-bigdataforsecuritystelligence
 
BC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan PresentationBC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan PresentationElijah Willie
 
NetSim Webinar on Network Attacks and Detection
NetSim Webinar on Network Attacks and DetectionNetSim Webinar on Network Attacks and Detection
NetSim Webinar on Network Attacks and DetectionDESHPANDE M
 
Honeypots.ppt1800363876
Honeypots.ppt1800363876Honeypots.ppt1800363876
Honeypots.ppt1800363876Momita Sharma
 
The Future of Automated Malware Generation
The Future of Automated Malware GenerationThe Future of Automated Malware Generation
The Future of Automated Malware GenerationStephan Chenette
 
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Pete Burnap
 
Anomalous payload based network intrusion detection
Anomalous payload based network intrusion detectionAnomalous payload based network intrusion detection
Anomalous payload based network intrusion detectionUltraUploader
 
Malware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning TechniquesMalware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning TechniquesArshadRaja786
 
Analisis Estatico y de Comportamiento de un Binario Malicioso
Analisis Estatico y de Comportamiento de un Binario MaliciosoAnalisis Estatico y de Comportamiento de un Binario Malicioso
Analisis Estatico y de Comportamiento de un Binario MaliciosoConferencias FIST
 

Similar to Tradeoffs in Automatic Provenance Capture (20)

SmartData Webinar: Applying Neocortical Research to Streaming Analytics
SmartData Webinar: Applying Neocortical Research to Streaming AnalyticsSmartData Webinar: Applying Neocortical Research to Streaming Analytics
SmartData Webinar: Applying Neocortical Research to Streaming Analytics
 
Predictive Analytics with Numenta Machine Intelligence
Predictive Analytics with Numenta Machine IntelligencePredictive Analytics with Numenta Machine Intelligence
Predictive Analytics with Numenta Machine Intelligence
 
Towards a Threat Hunting Automation Maturity Model
Towards a Threat Hunting Automation Maturity ModelTowards a Threat Hunting Automation Maturity Model
Towards a Threat Hunting Automation Maturity Model
 
Hadoop / Spark on Malware Expression
Hadoop / Spark on Malware ExpressionHadoop / Spark on Malware Expression
Hadoop / Spark on Malware Expression
 
Cansec West 2009
Cansec West 2009Cansec West 2009
Cansec West 2009
 
Forensic Analysis - Empower Tech Days 2013
Forensic Analysis - Empower Tech Days 2013Forensic Analysis - Empower Tech Days 2013
Forensic Analysis - Empower Tech Days 2013
 
The data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architecturesThe data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architectures
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computation
 
PAACDA Comprehensive Data Corruption Detection Algorithm.docx
PAACDA Comprehensive Data Corruption Detection Algorithm.docxPAACDA Comprehensive Data Corruption Detection Algorithm.docx
PAACDA Comprehensive Data Corruption Detection Algorithm.docx
 
Navy security contest-bigdataforsecurity
Navy security contest-bigdataforsecurityNavy security contest-bigdataforsecurity
Navy security contest-bigdataforsecurity
 
BC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan PresentationBC-Cancer ChimeraScan Presentation
BC-Cancer ChimeraScan Presentation
 
NetSim Webinar on Network Attacks and Detection
NetSim Webinar on Network Attacks and DetectionNetSim Webinar on Network Attacks and Detection
NetSim Webinar on Network Attacks and Detection
 
An Analytics Platform for Connected Vehicles
An Analytics Platform for Connected VehiclesAn Analytics Platform for Connected Vehicles
An Analytics Platform for Connected Vehicles
 
Honeypots.ppt1800363876
Honeypots.ppt1800363876Honeypots.ppt1800363876
Honeypots.ppt1800363876
 
The Future of Automated Malware Generation
The Future of Automated Malware GenerationThe Future of Automated Malware Generation
The Future of Automated Malware Generation
 
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
Real-time Classification of Malicious URLs on Twitter using Machine Activity ...
 
Monitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp DockerMonitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp Docker
 
Anomalous payload based network intrusion detection
Anomalous payload based network intrusion detectionAnomalous payload based network intrusion detection
Anomalous payload based network intrusion detection
 
Malware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning TechniquesMalware Detection Using Machine Learning Techniques
Malware Detection Using Machine Learning Techniques
 
Analisis Estatico y de Comportamiento de un Binario Malicioso
Analisis Estatico y de Comportamiento de un Binario MaliciosoAnalisis Estatico y de Comportamiento de un Binario Malicioso
Analisis Estatico y de Comportamiento de un Binario Malicioso
 

More from Paul Groth

Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIPaul Groth
 
Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningPaul Groth
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Paul Groth
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-cziPaul Groth
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph MaintenancePaul Groth
 
Knowledge Graph Futures
Knowledge Graph FuturesKnowledge Graph Futures
Knowledge Graph FuturesPaul Groth
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph MaintenancePaul Groth
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
 
Thinking About the Making of Data
Thinking About the Making of DataThinking About the Making of Data
Thinking About the Making of DataPaul Groth
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text Paul Groth
 
From Data Search to Data Showcasing
From Data Search to Data ShowcasingFrom Data Search to Data Showcasing
From Data Search to Data ShowcasingPaul Groth
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphPaul Groth
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for SciencePaul Groth
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?Paul Groth
 
Diversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsDiversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsPaul Groth
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsPaul Groth
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsPaul Groth
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chainPaul Groth
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicinePaul Groth
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataPaul Groth
 

More from Paul Groth (20)

Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AI
 
Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learning
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-czi
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
 
Knowledge Graph Futures
Knowledge Graph FuturesKnowledge Graph Futures
Knowledge Graph Futures
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
 
Thinking About the Making of Data
Thinking About the Making of DataThinking About the Making of Data
Thinking About the Making of Data
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text
 
From Data Search to Data Showcasing
From Data Search to Data ShowcasingFrom Data Search to Data Showcasing
From Data Search to Data Showcasing
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge Graph
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for Science
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?
 
Diversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsDiversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domains
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge Graphs
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chain
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicine
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
 

Recently uploaded

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Recently uploaded (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Tradeoffs in Automatic Provenance Capture

  • 1. Trade-offs in Automatic Provenance Capture Manolis Stamatogiannakis, Hasanat Kazmi, Hashim Sharif, Remco Vermeulen, Ashish Gehani, Herbert Bos, and Paul Groth
  • 2. Capturing Provenance Disclosed Provenance + Accuracy + High-level semantics – Intrusive – Manual Effort Observed Provenance – False positives – Semantic Gap + Non-intrusive + Minimal manual effort CPL (Macko ‘12) Trio (Widom ‘09) PrIME (Miles ‘09) Taverna (Oinn ‘06) VisTrails (Fraire ‘06) ES3 (Frew ‘08) Trec (Vahdat ‘98) PASSv2 (Holland ‘08) DTrace Tool (Gessiou ‘12) 2 OPUS (Balakrishnan ‘ 13)
  • 3. https://github.com/ashish- gehani/SPADE/wiki • Strace Reporter – Programs run under strace. Produced log is parsed to extract provenance. • LLVMTrace – Instrumentation added to function boundaries at compile time. • DataTracker – Dynamic Taint Analysis. Bytes associated with metadata which are propagated as the program executes. 3 SPADEv2 – Provenance Collection
  • 7. Incomplete Picture • Faster, but how much? • What is the performance “price” for fewer false positives? • Does a compile-time solution worth the effort? 7
  • 8. How can one get more insight? Run a benchmark! 8
  • 9. Which one? • LMBench, UnixBench, Postmark, BLAST, SPECint… • [Traeger 08]: “Most popular benchmarks are flawed.” • No-matter what you chose, there will be blind spots. 9
  • 10. Start simple: UnixBench • Well understood sub-benchmarks. • Emphasizes on performance of system calls. • System calls are commonly used for the extraction of provenance. • More insight on which collection backend would suit specific applications. • We’ll have a performance baseline to improve the specific implementations. 10
  • 13. Performance vs. Integration Effort • Capturing provenance from completely unmodified programs may degrade performance. • Modification of either the source (LLVMTrace) or the platform (LPM, Hi-Fi) should be considered for a production deployment. 13
  • 14. Performance vs. Provenance Granularity • We couldn’t verify this intuition for the case of strace reoporter compared to LLVMTrace. – Strace reporter implementation is not optimal. • Tracking fine-grained provenance may interfere with existing optimizations. – E.g. buffering I/O does not benefit DataTracker. 14
  • 15. Performance vs. False Positives/Analysis Scope • “Brute-forcing” a low false-positive ratio with the “track everything” approach of DataTracker is prohibitively expensive. • Limiting the analysis scope gives a performance boost. • If we exploit known semantics, we can have the best of both worlds. – Pre-existing semantic knowledge: LLVMTrace – Dynamically acquired knowledge: ProTracer [Ma 2016] 15
  • 17. Takeaway: System Event Tracing • A good start for quick deployments • Simple versions may be expensive • What happens in the binary? 17
  • 18. Takeaway: Compile-time Instrumentation • Middle-ground between disclosed and automatic provenance collection. • But you have to have access to source 18
  • 19. Takeaway: Taint Analysis • Prohibitively expensive for computation- intensive programs. • Likely to remain so, even after optimizations. • Reserved for provenance analysis of unknown/legacy software. • Offline approach (Stamatogiannakis TAPP’15) 19
  • 20. Generalizing the Results • Only one implementation was tested for each method. • Repeating testing with alternative implementations will provide confidence for the insights gained. • More confidence when choosing a specific collection method. 20 Different methods Differentimplementations
  • 21. Implementation Details Matter • Our results are influenced by the specifics of the implementation. • Anecdote: The initial implementation of LLVMTrace was actually slower than strace reporter. 21
  • 22. Provenance Quality • Qualitative features of the provenance are also very important. • How many vertices/edges are contained in the generated provenance graph? • Precision/Recall based on provenance ground truth. 22 Performance Benchmarks QualitativeBenchmarks
  • 23. Where to go next? • UnixBench is a basic benchmark. • SPEC: Comprehensive in terms of performance evaluation. – Hard to get the provenance ground truth – assess quality of captured provenance. • Better directions: – Coreutils based micro-benchmarks. – Macro-benchmarks (e.g. Postmark, compilation benchmarks). 23
  • 24. Conclusion • Automatic provenance capture is an important part of the ecosystem • Trade-offs in different capture modes • Benchmarking – to inform • Common platforms are essential 24

Editor's Notes

  1. What should I recommend? It depends?
  2. Questions not answered only by intuition.
  3. Traeger was focusing on using benchmarks for measuring filesystem/storage performance. His observation is pretty much valid for using benchmarks for measuring other types of performance.
  4. SPEC includes several sub-benchmarks which may be atypical for provenance analysis. E.g. discrete event simulator or quantum computer simulator. If we want to also measure precision/recall, it is hard to get the ground truth for the provenance generated by these benchmarks.
  5. 1. execl-xput: How fast the current process image can be replaced with a new one, as a result of an execve system call. 2. fcopy-256, fcopy-1024, fcopy-4096: Speed of a file-to-file copy using dif- ferent buffer sizes. 3. pipe-xput, pipe-cs: Speed of communication over pipes. In the first test, the read and writes on the pipe happen from a single process. In the second test a second process is spawned, so the communication also includes a context switch between the two. 4. spawn-xput: A simple fork-wait loop to measure how much time is needed to create and then destroy a process. 5. shell-1, shell-8: Execution speed for the processing of a data file. The processing is implemented using common unix utilities, wrapped in a shell script. The two tests differ in the number of concurrently executing scripts. 6. syscall: System call overhead. The test uses getpid to measure this. The specific system call is chosen because it requires minimal in-kernel processing, so its main overhead comes from the switch between kernel and user mode.
  6. Degradation depends on method used.
  7. LPM [Bates 15] already supports the SPADE DSL reporter.
  8. Reason for being slower: Lack of buffering. A new connection was opened each time we needed to output a piece of provenance.
  9. SPADEv2 provides easy interfacing with other provenance systems via the DSL Report. Linux Provenance Modules [Bates 15] already support it. This makes it a good platform for measuring qualitative features (such as # of edges/vertices) and also to run queries that would verify if the ground truth was captured.
  10. Execl: excve speed Fcopy-*: file copy with different buffer sizes Pipe-*: pipe communcation Spawn: fork-wait loop (process creation/destruction speed) Shell-*: Unix utilities wrapped in a script. Similar to what coreutils testing would yield. Syscall: system call overhead (uses getpid as the most “lightweight” system call)