SlideShare une entreprise Scribd logo
1  sur  15
Fostering Serendipity through Big 
Linked Data 
Muhammad Saleem , Maulik R. Kamdar , Aftab Iqbal , 
Shanmukha Sampath , Helena F. Deus , and Axel-Cyrille 
Ngonga Ngomo 
Semantic Web Challenge at ISWC2013, October 21-25 , 2013, Sydney, Australia
Agenda 
• Motivation 
• Datasets 
• Architecture 
• Evaluation 
• Requirements 
• Demo 
• Conclusion and Future Work
Motivation 
Fostering Serendipity through Big Data 
Triplification, Continuous Integration, 
and Visualization
Triplification: Linked TCGA 
• TCGA is publicly accessible atlas of cancer 
related data from National Cancer Institute 
(NCI) 
– 9000 patients 
– 33 cancer types 
– 147,645 raw data files 
– 12.7 TB 
• Only 46% of the total expected data with 
new data being submitted every day 
• Goal is to enable cancer researchers to 
make and validate important discoveries 
• Total Linked TCGA > 30 billion triples 
(Largest Dataset of LOD)
Triplification:PubMed 
• Collection of publications from the bio-medical 
domain 
• Large amount of metadata (MESH Terms) 
• 23+ million publications 
• 10,000 new publications/month
Big Data Continuous Integration 
TopFed 
Parser 
Federator Optimizer 
Integrator 
Results 
SPARQL Query Results 
Sub-query 
PubMed 
Entrez Utilities 
RDFizer 
Auto 
Loader 
TCGA Data 
Portal 
SPARQL 
endpoint 
RDF 
SPARQL 
endpoint 
RDF 
SPARQL 
endpoint 
RDF 
Index
Exon-Expression 
Methylation 
C-1 ∨ Category 
Colour = blue 
For each query triple t(s, p, o) ∈ T 
Highly Scalable 
b1 b2 p1 p2 p3 p4 p5 p6 g1 g2 g3 g4 g5 g6 g7 g8 g9 
C = {CNV, SNP, E-Gene, E-Protein, miRNA, Clinical} 
M = {beta_value, position} F = {Expression-Exon} 
(CNV, SNP, E-Gene, 
miRNA, 
E-Protein, Clinical) 
D = {seg_mean, rpmmm, scaled_est, p_exp_val} 
B = {DNA-Methylation} 
C-1 = {{p ∈ {D ∪ A ∪ G} ∨ {p = rdf:type ∧ o ∈ C}} ∧ {{S-Join(p, D ∪ C) ∨ P-Join(p, D ∪ C) } ∨ {!S-Join(p, M ∪ B ∪ E ∪ F) ∧ 
!P-Join(p, M ∪ B ∪ E ∪ F) }}} 
C-2 = {{p ∈ {E ∪ A ∪ G} ∨ {p = rdf:type ∧ o ∈ F}} ∧ {{S-Join(p, E ∪ F) ∨ P-Join(p, E ∪ F)} ∨ {!S-Join(p, M ∪ B ∪ D ∪ C) ∧ 
!P-Join(p, M ∪ B ∪ D ∪ C) }}} 
C-3 = {{p ∈ {M∪ A} ∨ {p = rdf:type ∧ o ∈ B}} ∧ {{S-Join(p,M ∪ B) ∨ P-Join(p, M∪ B) } ∨ {!S-Join(p, E ∪ F ∪ D ∪ C) ∧ 
!P-Join(p, E ∪ F ∪ D ∪ C) }}} 
IF tumour lookup is successful 
forward to corresponding 
leaf 
Else 
broadcast to every one 
A = {chromosome, result, bcr_patient_barcode} G = {start, stop} 
E = {RPKM} 
Tumours 
SPARQL 
endpoints 
C-2 ∨ Category 
Colour = pink 
C-3 ∨ Category 
Colour = green 
1-16 17-33 1-5 6-11 12-16 17-22 23-27 28-33 1-4 5-8 9-12 13-16 17-20 21-24 25-27 28-30 31-33
Evaluation:Number of Sub-Query Submission 
60 
50 
40 
30 
20 
10 
FedX number of Sub-Query Submission TopFedE number of Sub-Query Submission 
• TopFed number of sub-queries submission is 1/3 to FedX 
• Number of ASK requests 
– FedX 480 
– TopFed 10 
0 
1 2 3 4 5 6 7 8 9 10 Avg
Evaluation: Query Runtime 
100000 
10000 
1000 
100 
10 
1 
1 2 3 4 5 6 7 8 9 10 Average 
Query Execution Time (msec) in 
log scale 
FedX TopFed 
• TopFed outperform FedX significantly on 90% of the queries 
• On average, the query run time of TopFed is about 1/3 to that 
of FedX 
• TopFed‘s best run-time (query 2, query 3) is more than 75 times 
smaller than that of FedX
Big Data Track Requirements 
• Data Volume 
– 7.36 billion triples from Linked TCGA 
– 23 million publications from PubMed 
• Data Variety 
– The Linked TCGA data was extracted from raw text files of different 
structures 
– Processed the metadata associated with PubMed publications and 
transform them into RDF 
– Unstructured data (publication abstracts) is processed to extract 
mentions of gene names and cancers 
• Data Velocity 
– TCGA data doubles /2 months 
– PubMed publications 10k/month
Big Data Visualization
Tumor-wise Visualization
PubMed Paper-wise Visualization
Genome-wise Patients Results Visualization
Everything is Public 
• Demo: http://srvgal78.deri.ie/tcga-pubmed/ 
• TopFed: https://code.google.com/p/topfed/ 
• TCGA Data Refiner, RDFizer: http://goo.gl/vSnBEJ 
• Utilities: http://goo.gl/kNrFdI 
• Linked TCGA : http://tcga.deri.ie/ 
saleem@informatik.uni-leipzig.de 
AKSW, University of Leipzig, Germany

Contenu connexe

Tendances

Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedHail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Spark Summit
 
Blast bioinformatics
Blast bioinformaticsBlast bioinformatics
Blast bioinformatics
atmapandey
 

Tendances (20)

Data science courses
Data science coursesData science courses
Data science courses
 
FastA HOMOLOGY SEARCH ALGORITHM
FastA HOMOLOGY SEARCH ALGORITHMFastA HOMOLOGY SEARCH ALGORITHM
FastA HOMOLOGY SEARCH ALGORITHM
 
Fasta
FastaFasta
Fasta
 
The Cancer Genomics Cloud (CGC) pilots - an Introduction
The Cancer Genomics Cloud (CGC) pilots  - an IntroductionThe Cancer Genomics Cloud (CGC) pilots  - an Introduction
The Cancer Genomics Cloud (CGC) pilots - an Introduction
 
Mowlam-semantic publishing-up-nfdp13
Mowlam-semantic publishing-up-nfdp13Mowlam-semantic publishing-up-nfdp13
Mowlam-semantic publishing-up-nfdp13
 
The Cancer Genomics Cloud (CGC) Pilots NIH IC Show and Tell
The Cancer Genomics Cloud (CGC) Pilots   NIH IC Show and TellThe Cancer Genomics Cloud (CGC) Pilots   NIH IC Show and Tell
The Cancer Genomics Cloud (CGC) Pilots NIH IC Show and Tell
 
7 advanced uses of rdfs
7 advanced uses of rdfs7 advanced uses of rdfs
7 advanced uses of rdfs
 
Fasta
FastaFasta
Fasta
 
The Future of Microalgal Taxonomy
The Future of Microalgal TaxonomyThe Future of Microalgal Taxonomy
The Future of Microalgal Taxonomy
 
Mayank
MayankMayank
Mayank
 
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedHail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
 
Blast bioinformatics
Blast bioinformaticsBlast bioinformatics
Blast bioinformatics
 
Biothings APIs: high-performance bioentity-centric web services
Biothings APIs: high-performance bioentity-centric web servicesBiothings APIs: high-performance bioentity-centric web services
Biothings APIs: high-performance bioentity-centric web services
 
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
 API-Centric Data Integration for Human Genomics Reference Databases: Achieve... API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
 
Materials Data Facility as Community Database to Share Nano-manufacturing Rec...
Materials Data Facility as Community Database to Share Nano-manufacturing Rec...Materials Data Facility as Community Database to Share Nano-manufacturing Rec...
Materials Data Facility as Community Database to Share Nano-manufacturing Rec...
 
How Big Data could benefit from Physics ?
How Big Data could benefit from Physics ?How Big Data could benefit from Physics ?
How Big Data could benefit from Physics ?
 
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph DatabaseAnalyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
 
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
 
VariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitVariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn Langit
 
Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAM
 

Similaire à Fostering Serendipity through Big Linked Data

Insights from Building the Future of Drug Discovery with Apache Spark with Lu...
Insights from Building the Future of Drug Discovery with Apache Spark with Lu...Insights from Building the Future of Drug Discovery with Apache Spark with Lu...
Insights from Building the Future of Drug Discovery with Apache Spark with Lu...
Databricks
 
E-Utilities
E-UtilitiesE-Utilities
E-Utilities
mkim8
 
Accelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbAccelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo db
MongoDB
 
Health Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha NoyHealth Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha Noy
Health Data Consortium
 
Introduction to High-performance In-memory Genome Project at HPI
Introduction to High-performance In-memory Genome Project at HPI Introduction to High-performance In-memory Genome Project at HPI
Introduction to High-performance In-memory Genome Project at HPI
Matthieu Schapranow
 
PubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligencePubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligence
Sunghwan Kim
 

Similaire à Fostering Serendipity through Big Linked Data (20)

Transparency in the Data Supply Chain
Transparency in the Data Supply ChainTransparency in the Data Supply Chain
Transparency in the Data Supply Chain
 
MPDB Presentation
MPDB PresentationMPDB Presentation
MPDB Presentation
 
Multi-omics infrastructure and data for R/Bioconductor
Multi-omics infrastructure and data for R/BioconductorMulti-omics infrastructure and data for R/Bioconductor
Multi-omics infrastructure and data for R/Bioconductor
 
Preserving the currency of analytics outcomes over time through selective re-...
Preserving the currency of analytics outcomes over time through selective re-...Preserving the currency of analytics outcomes over time through selective re-...
Preserving the currency of analytics outcomes over time through selective re-...
 
Insights from Building the Future of Drug Discovery with Apache Spark with Lu...
Insights from Building the Future of Drug Discovery with Apache Spark with Lu...Insights from Building the Future of Drug Discovery with Apache Spark with Lu...
Insights from Building the Future of Drug Discovery with Apache Spark with Lu...
 
E-Utilities
E-UtilitiesE-Utilities
E-Utilities
 
NCBI API - Integration into analysis code
NCBI API - Integration into analysis codeNCBI API - Integration into analysis code
NCBI API - Integration into analysis code
 
Accelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDBAccelerate Pharmaceutical R&D with Big Data and MongoDB
Accelerate Pharmaceutical R&D with Big Data and MongoDB
 
Accelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbAccelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo db
 
Databases_CSS2.pptx
Databases_CSS2.pptxDatabases_CSS2.pptx
Databases_CSS2.pptx
 
Health Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha NoyHealth Datapalooza 2013: Open Government Data - Natasha Noy
Health Datapalooza 2013: Open Government Data - Natasha Noy
 
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use CasesFrom Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
 
Working With Large-Scale Clinical Datasets
Working With Large-Scale Clinical DatasetsWorking With Large-Scale Clinical Datasets
Working With Large-Scale Clinical Datasets
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
RUCK 2017 김성환 R 패키지 메타주성분분석(MetaPCA)
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
Qi liu 08.08.2014
Qi liu 08.08.2014Qi liu 08.08.2014
Qi liu 08.08.2014
 
High-Dimensional Machine Learning for Medicine
High-Dimensional Machine Learning for MedicineHigh-Dimensional Machine Learning for Medicine
High-Dimensional Machine Learning for Medicine
 
Introduction to High-performance In-memory Genome Project at HPI
Introduction to High-performance In-memory Genome Project at HPI Introduction to High-performance In-memory Genome Project at HPI
Introduction to High-performance In-memory Genome Project at HPI
 
PubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligencePubChem for drug discovery in the age of big data and artificial intelligence
PubChem for drug discovery in the age of big data and artificial intelligence
 

Plus de Muhammad Saleem

QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...
QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...
QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...
Muhammad Saleem
 
SQCFramework: SPARQL Query containment Benchmark Generation Framework
SQCFramework: SPARQL Query containment  Benchmark Generation Framework SQCFramework: SPARQL Query containment  Benchmark Generation Framework
SQCFramework: SPARQL Query containment Benchmark Generation Framework
Muhammad Saleem
 
Question Answering Over Linked Data: What is Difficult to Answer? What Affect...
Question Answering Over Linked Data: What is Difficult to Answer? What Affect...Question Answering Over Linked Data: What is Difficult to Answer? What Affect...
Question Answering Over Linked Data: What is Difficult to Answer? What Affect...
Muhammad Saleem
 
Efficient source selection for sparql endpoint federation
Efficient source selection for sparql endpoint federationEfficient source selection for sparql endpoint federation
Efficient source selection for sparql endpoint federation
Muhammad Saleem
 

Plus de Muhammad Saleem (19)

QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...
QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...
QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...
 
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...
 
LargeRDFBench
LargeRDFBenchLargeRDFBench
LargeRDFBench
 
Extended LargeRDFBench
Extended LargeRDFBenchExtended LargeRDFBench
Extended LargeRDFBench
 
CostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation
CostFed: Cost-Based Query Optimization for SPARQL Endpoint FederationCostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation
CostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation
 
SQCFramework: SPARQL Query containment Benchmark Generation Framework
SQCFramework: SPARQL Query containment  Benchmark Generation Framework SQCFramework: SPARQL Query containment  Benchmark Generation Framework
SQCFramework: SPARQL Query containment Benchmark Generation Framework
 
Question Answering Over Linked Data: What is Difficult to Answer? What Affect...
Question Answering Over Linked Data: What is Difficult to Answer? What Affect...Question Answering Over Linked Data: What is Difficult to Answer? What Affect...
Question Answering Over Linked Data: What is Difficult to Answer? What Affect...
 
Federated Query Formulation and Processing Through BioFed
Federated Query Formulation and Processing Through BioFedFederated Query Formulation and Processing Through BioFed
Federated Query Formulation and Processing Through BioFed
 
Fine-grained Evaluation of SPARQL Endpoint Federation Systems
Fine-grained Evaluation of SPARQL Endpoint Federation SystemsFine-grained Evaluation of SPARQL Endpoint Federation Systems
Fine-grained Evaluation of SPARQL Endpoint Federation Systems
 
SPARQL Querying Benchmarks ISWC2016
SPARQL Querying Benchmarks ISWC2016SPARQL Querying Benchmarks ISWC2016
SPARQL Querying Benchmarks ISWC2016
 
Efficient source selection for sparql endpoint federation
Efficient source selection for sparql endpoint federationEfficient source selection for sparql endpoint federation
Efficient source selection for sparql endpoint federation
 
LSQ: The Linked SPARQL Queries Dataset
LSQ: The Linked SPARQL Queries DatasetLSQ: The Linked SPARQL Queries Dataset
LSQ: The Linked SPARQL Queries Dataset
 
FEASIBLE-Benchmark-Framework-ISWC2015
FEASIBLE-Benchmark-Framework-ISWC2015FEASIBLE-Benchmark-Framework-ISWC2015
FEASIBLE-Benchmark-Framework-ISWC2015
 
Federated SPARQL Query Processing ISWC2015 Tutorial
Federated SPARQL Query Processing ISWC2015 TutorialFederated SPARQL Query Processing ISWC2015 Tutorial
Federated SPARQL Query Processing ISWC2015 Tutorial
 
SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
SAFE: Policy Aware SPARQL Query Federation Over RDF Data CubesSAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
 
Federated SPARQL query processing over the Web of Data
Federated SPARQL query processing over the Web of DataFederated SPARQL query processing over the Web of Data
Federated SPARQL query processing over the Web of Data
 
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint FederationHiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
 
DAW: Duplicate-AWare Federated Query Processing over the Web of Data
DAW: Duplicate-AWare Federated Query Processing over the Web of DataDAW: Duplicate-AWare Federated Query Processing over the Web of Data
DAW: Duplicate-AWare Federated Query Processing over the Web of Data
 
Linked Cancer Genome Atlas Database
Linked Cancer Genome Atlas DatabaseLinked Cancer Genome Atlas Database
Linked Cancer Genome Atlas Database
 

Dernier

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Dernier (20)

INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 

Fostering Serendipity through Big Linked Data

  • 1. Fostering Serendipity through Big Linked Data Muhammad Saleem , Maulik R. Kamdar , Aftab Iqbal , Shanmukha Sampath , Helena F. Deus , and Axel-Cyrille Ngonga Ngomo Semantic Web Challenge at ISWC2013, October 21-25 , 2013, Sydney, Australia
  • 2. Agenda • Motivation • Datasets • Architecture • Evaluation • Requirements • Demo • Conclusion and Future Work
  • 3. Motivation Fostering Serendipity through Big Data Triplification, Continuous Integration, and Visualization
  • 4. Triplification: Linked TCGA • TCGA is publicly accessible atlas of cancer related data from National Cancer Institute (NCI) – 9000 patients – 33 cancer types – 147,645 raw data files – 12.7 TB • Only 46% of the total expected data with new data being submitted every day • Goal is to enable cancer researchers to make and validate important discoveries • Total Linked TCGA > 30 billion triples (Largest Dataset of LOD)
  • 5. Triplification:PubMed • Collection of publications from the bio-medical domain • Large amount of metadata (MESH Terms) • 23+ million publications • 10,000 new publications/month
  • 6. Big Data Continuous Integration TopFed Parser Federator Optimizer Integrator Results SPARQL Query Results Sub-query PubMed Entrez Utilities RDFizer Auto Loader TCGA Data Portal SPARQL endpoint RDF SPARQL endpoint RDF SPARQL endpoint RDF Index
  • 7. Exon-Expression Methylation C-1 ∨ Category Colour = blue For each query triple t(s, p, o) ∈ T Highly Scalable b1 b2 p1 p2 p3 p4 p5 p6 g1 g2 g3 g4 g5 g6 g7 g8 g9 C = {CNV, SNP, E-Gene, E-Protein, miRNA, Clinical} M = {beta_value, position} F = {Expression-Exon} (CNV, SNP, E-Gene, miRNA, E-Protein, Clinical) D = {seg_mean, rpmmm, scaled_est, p_exp_val} B = {DNA-Methylation} C-1 = {{p ∈ {D ∪ A ∪ G} ∨ {p = rdf:type ∧ o ∈ C}} ∧ {{S-Join(p, D ∪ C) ∨ P-Join(p, D ∪ C) } ∨ {!S-Join(p, M ∪ B ∪ E ∪ F) ∧ !P-Join(p, M ∪ B ∪ E ∪ F) }}} C-2 = {{p ∈ {E ∪ A ∪ G} ∨ {p = rdf:type ∧ o ∈ F}} ∧ {{S-Join(p, E ∪ F) ∨ P-Join(p, E ∪ F)} ∨ {!S-Join(p, M ∪ B ∪ D ∪ C) ∧ !P-Join(p, M ∪ B ∪ D ∪ C) }}} C-3 = {{p ∈ {M∪ A} ∨ {p = rdf:type ∧ o ∈ B}} ∧ {{S-Join(p,M ∪ B) ∨ P-Join(p, M∪ B) } ∨ {!S-Join(p, E ∪ F ∪ D ∪ C) ∧ !P-Join(p, E ∪ F ∪ D ∪ C) }}} IF tumour lookup is successful forward to corresponding leaf Else broadcast to every one A = {chromosome, result, bcr_patient_barcode} G = {start, stop} E = {RPKM} Tumours SPARQL endpoints C-2 ∨ Category Colour = pink C-3 ∨ Category Colour = green 1-16 17-33 1-5 6-11 12-16 17-22 23-27 28-33 1-4 5-8 9-12 13-16 17-20 21-24 25-27 28-30 31-33
  • 8. Evaluation:Number of Sub-Query Submission 60 50 40 30 20 10 FedX number of Sub-Query Submission TopFedE number of Sub-Query Submission • TopFed number of sub-queries submission is 1/3 to FedX • Number of ASK requests – FedX 480 – TopFed 10 0 1 2 3 4 5 6 7 8 9 10 Avg
  • 9. Evaluation: Query Runtime 100000 10000 1000 100 10 1 1 2 3 4 5 6 7 8 9 10 Average Query Execution Time (msec) in log scale FedX TopFed • TopFed outperform FedX significantly on 90% of the queries • On average, the query run time of TopFed is about 1/3 to that of FedX • TopFed‘s best run-time (query 2, query 3) is more than 75 times smaller than that of FedX
  • 10. Big Data Track Requirements • Data Volume – 7.36 billion triples from Linked TCGA – 23 million publications from PubMed • Data Variety – The Linked TCGA data was extracted from raw text files of different structures – Processed the metadata associated with PubMed publications and transform them into RDF – Unstructured data (publication abstracts) is processed to extract mentions of gene names and cancers • Data Velocity – TCGA data doubles /2 months – PubMed publications 10k/month
  • 15. Everything is Public • Demo: http://srvgal78.deri.ie/tcga-pubmed/ • TopFed: https://code.google.com/p/topfed/ • TCGA Data Refiner, RDFizer: http://goo.gl/vSnBEJ • Utilities: http://goo.gl/kNrFdI • Linked TCGA : http://tcga.deri.ie/ saleem@informatik.uni-leipzig.de AKSW, University of Leipzig, Germany