SlideShare une entreprise Scribd logo
1  sur  18
Télécharger pour lire hors ligne
The UniProt SPARQL endpoint:
20 billion quads in production
© 2015 SIB
Why provide a public SPARQL endpoint
• A 10 man wet laboratory can not afford:
– to host their own database houses holding all or even a bit of all
life science data.
– not to have access, and use, existing life science information.
• Classical SQL can be provided on the web
– Is not practical
– No federation
– No standards adherence
• Document centric REST is not enough
– Swiss-Prot available as REST (over e-mail !!) since
1986
2
© 2015 SIB
3
© 2015 SIB
19,361,572,066
4
Node 1
!
64 cpu cores
256 GB ram
2.5 TB consumer SSD
Load Balancer = Apache mod_balancer
Node 2
!
64 cpu cores
256 GB ram
2.5 TB consumer SSD
© 2015 SIB
19,361,572,066
5
Node 1 Node 2
Tomcat + Sesame + UI
Virtuoso 7.2 (+)
Tomcat + Sesame + UI
Virtuoso 7.2 (+)
Load Balancer = Apache mod_balancer
© 2015 SIB
Dedicated machine for loading and testing
• Loading RDF data “solved” problem
– 500,000 triples per second easy
• that’s what our machine plus virtuoso 7.2
– and some tricks does
– 1,000,000 possible (gunzip limit on our machine)
• nquads or rdf/xml
– higher values needs parallel readers
• or even lighter weight parsers
– highest observed rate
• 2.5 million per second on 1/4 exadata
– could be pushed higher
6
© 2015 SIB
Chalenges as a public endpoint provider
• Query load unpredictable
!
• Simple data discovery queries are hard
– 1 TB+ of DB files
– e.g. from monitoring services
!
• Query timeouts not sufficient
– aim for 100% utilisation
– what can http reasonably support
– we want to be able to answer hard questions
7
100
10'000
1'000'000
2014_06
2014_07
2014_08
2014_09
2014_10
2014_11
2015_01
2015_02
queries
ask
select
construct
describe
© 2015 SIB
Queries per UniProt release
peak: 35 million per month
50 queries per second
8
© 2015 SIB
Real users
Mix between hard analytics and super specific
Estimate somewhere between:
300 - 2000 real humans per month
9
© 2015 SIB
Really hard queries
10
SELECT (COUNT(DISTINCT(?iri) AS ?iriCount))
WHERE
{
{?iri ?p ?o}
UNION
{?s ?iri ?o}
UNION
{?s ?p ?iri}
FILTER(isIRI(?iri))
}
© 2014 SIB
Counting all 3,897,109,089 IRI takes a while
• Via iSQL
– 9 to 10 hours
– if no other users
• SQL alternative
!
!
!
!
!
!
• 18 minutes
– Faster tricks?
11
> SELECT COUNT(RI_ID) FROM RDF_IRI;
count INTEGER
_____________________________________
3897109089
1 Rows. -- 1126860 msec.
© 2015 SIB
Compilation wise unlikely to be found
• Are templates a good idea?
– e.g. JVM has intrinsics
• Long.bitCount()
– in java
!
!
!
!
!
!
– or 1 X86 instruction (in use since 1.6)
» POPCNT
12
i = i - ((i >>> 1) & 0x5555555555555555L);	
i = (i & 0x3333333333333333L) 	
	 	 	 + ((i >>> 2) & 0x3333333333333333L);	
i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL;	
i = i + (i >>> 8);	
i = i + (i >>> 16);	
i = i + (i >>> 32);	
return (int)i & 0x7f;
© 2015 SIB
Template/Intrinsics based SPARQL compilation
• Recognising query template matches can be difficult
– query normalisation?
13
© 2015 SIB
Similar query
• Virtuoso
– Index only scan?
• GraphDB
– Information stored in predicate statistics that are key
for optimiser
• Can information be fetched from there?
14
SELECT (COUNT(DISTINCT(?p) AS ?pc)
WHERE {?s ?p ?pc }
© 2015 SIB
Challenges
• Virtuoso
– transitive queries
– standards compliance
!
• GraphDB
– analytical queries
– complete store scans
!
• Oracle 12c1
– configuration
– global RDF tablespace
• difficult to manage as a normal Oracle DB
15
• Public monitoring also hard
– often lower uptime than what is being monitored
– robots.txt
– not enough community support
– service description
• not being parsed
– HEAD last modified?
© 2015 SIB
Public monitoring key aid in quality assurance
16
✔sparqles
© 2015 SIB
Key-Value orientated SPARQL endpoint anyone?
• assume 400 million
named graphs
– average 50 triples
• max 5000 triples
– get the whole
named graph
• single IO
operation
17
CONSTRUCT {}
FROM uniprot:P05067
WHERE {}
© 2015 SIB
18

Contenu connexe

Tendances

Spring Data and MongoDB
Spring Data and MongoDBSpring Data and MongoDB
Spring Data and MongoDB
Oliver Gierke
 

Tendances (18)

Spring Data and MongoDB
Spring Data and MongoDBSpring Data and MongoDB
Spring Data and MongoDB
 
Big data analysing genomics and the bdg project
Big data   analysing genomics and the bdg projectBig data   analysing genomics and the bdg project
Big data analysing genomics and the bdg project
 
ISBG 2016 - XPages on IBM Bluemix
ISBG 2016 - XPages on IBM BluemixISBG 2016 - XPages on IBM Bluemix
ISBG 2016 - XPages on IBM Bluemix
 
ISAX
ISAXISAX
ISAX
 
Introducing TiDB - Percona Live Frankfurt
Introducing TiDB - Percona Live FrankfurtIntroducing TiDB - Percona Live Frankfurt
Introducing TiDB - Percona Live Frankfurt
 
GraphDb in XPages
GraphDb in XPagesGraphDb in XPages
GraphDb in XPages
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
 
Etl with apache impala by athemaster
Etl with apache impala by athemasterEtl with apache impala by athemaster
Etl with apache impala by athemaster
 
The SparkSQL things you maybe confuse
The SparkSQL things you maybe confuseThe SparkSQL things you maybe confuse
The SparkSQL things you maybe confuse
 
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkSpark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
 
Approaching graph db
Approaching graph dbApproaching graph db
Approaching graph db
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
 
JugMarche: Neo4j 2 (Cypher)
JugMarche: Neo4j 2 (Cypher)JugMarche: Neo4j 2 (Cypher)
JugMarche: Neo4j 2 (Cypher)
 
Spark sql under the hood - Data KRK meetup
Spark sql under the hood  - Data KRK meetupSpark sql under the hood  - Data KRK meetup
Spark sql under the hood - Data KRK meetup
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
 
HDF Server
HDF ServerHDF Server
HDF Server
 
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
 
Information retrieval dynamic indexing
Information retrieval dynamic indexingInformation retrieval dynamic indexing
Information retrieval dynamic indexing
 

En vedette

UniProt and the Semantic Web
UniProt and the Semantic WebUniProt and the Semantic Web
UniProt and the Semantic Web
Chimezie Ogbuji
 
The annotation of plant proteins in UniProtKB
The annotation of plant proteins in UniProtKBThe annotation of plant proteins in UniProtKB
The annotation of plant proteins in UniProtKB
EBI
 
How to submit a sequence in NCBI
How to submit a sequence in NCBIHow to submit a sequence in NCBI
How to submit a sequence in NCBI
Minhaz Ahmed
 
The human genome project
The human genome projectThe human genome project
The human genome project
14pascba
 

En vedette (7)

UniProt and the Semantic Web
UniProt and the Semantic WebUniProt and the Semantic Web
UniProt and the Semantic Web
 
The annotation of plant proteins in UniProtKB
The annotation of plant proteins in UniProtKBThe annotation of plant proteins in UniProtKB
The annotation of plant proteins in UniProtKB
 
How to submit a sequence in NCBI
How to submit a sequence in NCBIHow to submit a sequence in NCBI
How to submit a sequence in NCBI
 
Introduction to NCBI
Introduction to NCBIIntroduction to NCBI
Introduction to NCBI
 
NCBI
NCBINCBI
NCBI
 
The human genome project
The human genome projectThe human genome project
The human genome project
 
Human genome project
Human genome projectHuman genome project
Human genome project
 

Similaire à The UniProt SPARQL endpoint: 20 billion quads in production

Top 10 Tips for an Effective Postgres Deployment
Top 10 Tips for an Effective Postgres DeploymentTop 10 Tips for an Effective Postgres Deployment
Top 10 Tips for an Effective Postgres Deployment
EDB
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
ALTER WAY
 
01 demystifying mysq-lfororacledbaanddeveloperv1
01 demystifying mysq-lfororacledbaanddeveloperv101 demystifying mysq-lfororacledbaanddeveloperv1
01 demystifying mysq-lfororacledbaanddeveloperv1
Ivan Ma
 

Similaire à The UniProt SPARQL endpoint: 20 billion quads in production (20)

Top 10 Tips for an Effective Postgres Deployment
Top 10 Tips for an Effective Postgres DeploymentTop 10 Tips for an Effective Postgres Deployment
Top 10 Tips for an Effective Postgres Deployment
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Building Fast Applications for Streaming Data
Building Fast Applications for Streaming DataBuilding Fast Applications for Streaming Data
Building Fast Applications for Streaming Data
 
Open Device Programmability: Hands-on Intro to RESTCONF (and a bit of NETCONF)
Open Device Programmability: Hands-on Intro to RESTCONF (and a bit of NETCONF)Open Device Programmability: Hands-on Intro to RESTCONF (and a bit of NETCONF)
Open Device Programmability: Hands-on Intro to RESTCONF (and a bit of NETCONF)
 
IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal Gemfire
IMCSummit 2015 - 1 IT Business  - The Evolution of Pivotal GemfireIMCSummit 2015 - 1 IT Business  - The Evolution of Pivotal Gemfire
IMCSummit 2015 - 1 IT Business - The Evolution of Pivotal Gemfire
 
Architecting Your Own DBaaS in a Private Cloud with EM12c
Architecting Your Own DBaaS in a Private Cloud with EM12cArchitecting Your Own DBaaS in a Private Cloud with EM12c
Architecting Your Own DBaaS in a Private Cloud with EM12c
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
Unlocking big data with Hadoop + MySQL
Unlocking big data with Hadoop + MySQLUnlocking big data with Hadoop + MySQL
Unlocking big data with Hadoop + MySQL
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
 
VoltDB Big Data Camp LA 2014 - Scott Jar
VoltDB  Big Data Camp LA 2014 - Scott JarVoltDB  Big Data Camp LA 2014 - Scott Jar
VoltDB Big Data Camp LA 2014 - Scott Jar
 
TIBCO Advanced Analytics Meetup (TAAM) - June 2015
TIBCO Advanced Analytics Meetup (TAAM) - June 2015TIBCO Advanced Analytics Meetup (TAAM) - June 2015
TIBCO Advanced Analytics Meetup (TAAM) - June 2015
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
 
01 demystifying mysq-lfororacledbaanddeveloperv1
01 demystifying mysq-lfororacledbaanddeveloperv101 demystifying mysq-lfororacledbaanddeveloperv1
01 demystifying mysq-lfororacledbaanddeveloperv1
 
Oracle DB In-Memory technologie v kombinaci s procesorem M7
Oracle DB In-Memory technologie v kombinaci s procesorem M7Oracle DB In-Memory technologie v kombinaci s procesorem M7
Oracle DB In-Memory technologie v kombinaci s procesorem M7
 
NoSQL no MySQL 5.7
NoSQL no MySQL 5.7NoSQL no MySQL 5.7
NoSQL no MySQL 5.7
 
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
 
Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)Time Series Databases for IoT (On-premises and Azure)
Time Series Databases for IoT (On-premises and Azure)
 

Plus de Jerven Bolleman (6)

Semantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQLSemantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQL
 
Why sparql tohu
Why sparql tohuWhy sparql tohu
Why sparql tohu
 
RDF: what and why plus a SPARQL tutorial
RDF: what and why plus a SPARQL tutorialRDF: what and why plus a SPARQL tutorial
RDF: what and why plus a SPARQL tutorial
 
Biohackathon2013: Tripling Bioinformatics Productivity
Biohackathon2013: Tripling Bioinformatics ProductivityBiohackathon2013: Tripling Bioinformatics Productivity
Biohackathon2013: Tripling Bioinformatics Productivity
 
Learning sparql 2012 12
Learning sparql 2012 12Learning sparql 2012 12
Learning sparql 2012 12
 
Uni protsparqlcloud
Uni protsparqlcloudUni protsparqlcloud
Uni protsparqlcloud
 

Dernier

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Dernier (20)

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
ManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide DeckManageIQ - Sprint 236 Review - Slide Deck
ManageIQ - Sprint 236 Review - Slide Deck
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 

The UniProt SPARQL endpoint: 20 billion quads in production

  • 1. The UniProt SPARQL endpoint: 20 billion quads in production
  • 2. © 2015 SIB Why provide a public SPARQL endpoint • A 10 man wet laboratory can not afford: – to host their own database houses holding all or even a bit of all life science data. – not to have access, and use, existing life science information. • Classical SQL can be provided on the web – Is not practical – No federation – No standards adherence • Document centric REST is not enough – Swiss-Prot available as REST (over e-mail !!) since 1986 2
  • 4. © 2015 SIB 19,361,572,066 4 Node 1 ! 64 cpu cores 256 GB ram 2.5 TB consumer SSD Load Balancer = Apache mod_balancer Node 2 ! 64 cpu cores 256 GB ram 2.5 TB consumer SSD
  • 5. © 2015 SIB 19,361,572,066 5 Node 1 Node 2 Tomcat + Sesame + UI Virtuoso 7.2 (+) Tomcat + Sesame + UI Virtuoso 7.2 (+) Load Balancer = Apache mod_balancer
  • 6. © 2015 SIB Dedicated machine for loading and testing • Loading RDF data “solved” problem – 500,000 triples per second easy • that’s what our machine plus virtuoso 7.2 – and some tricks does – 1,000,000 possible (gunzip limit on our machine) • nquads or rdf/xml – higher values needs parallel readers • or even lighter weight parsers – highest observed rate • 2.5 million per second on 1/4 exadata – could be pushed higher 6
  • 7. © 2015 SIB Chalenges as a public endpoint provider • Query load unpredictable ! • Simple data discovery queries are hard – 1 TB+ of DB files – e.g. from monitoring services ! • Query timeouts not sufficient – aim for 100% utilisation – what can http reasonably support – we want to be able to answer hard questions 7
  • 9. © 2015 SIB Real users Mix between hard analytics and super specific Estimate somewhere between: 300 - 2000 real humans per month 9
  • 10. © 2015 SIB Really hard queries 10 SELECT (COUNT(DISTINCT(?iri) AS ?iriCount)) WHERE { {?iri ?p ?o} UNION {?s ?iri ?o} UNION {?s ?p ?iri} FILTER(isIRI(?iri)) }
  • 11. © 2014 SIB Counting all 3,897,109,089 IRI takes a while • Via iSQL – 9 to 10 hours – if no other users • SQL alternative ! ! ! ! ! ! • 18 minutes – Faster tricks? 11 > SELECT COUNT(RI_ID) FROM RDF_IRI; count INTEGER _____________________________________ 3897109089 1 Rows. -- 1126860 msec.
  • 12. © 2015 SIB Compilation wise unlikely to be found • Are templates a good idea? – e.g. JVM has intrinsics • Long.bitCount() – in java ! ! ! ! ! ! – or 1 X86 instruction (in use since 1.6) » POPCNT 12 i = i - ((i >>> 1) & 0x5555555555555555L); i = (i & 0x3333333333333333L) + ((i >>> 2) & 0x3333333333333333L); i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL; i = i + (i >>> 8); i = i + (i >>> 16); i = i + (i >>> 32); return (int)i & 0x7f;
  • 13. © 2015 SIB Template/Intrinsics based SPARQL compilation • Recognising query template matches can be difficult – query normalisation? 13
  • 14. © 2015 SIB Similar query • Virtuoso – Index only scan? • GraphDB – Information stored in predicate statistics that are key for optimiser • Can information be fetched from there? 14 SELECT (COUNT(DISTINCT(?p) AS ?pc) WHERE {?s ?p ?pc }
  • 15. © 2015 SIB Challenges • Virtuoso – transitive queries – standards compliance ! • GraphDB – analytical queries – complete store scans ! • Oracle 12c1 – configuration – global RDF tablespace • difficult to manage as a normal Oracle DB 15
  • 16. • Public monitoring also hard – often lower uptime than what is being monitored – robots.txt – not enough community support – service description • not being parsed – HEAD last modified? © 2015 SIB Public monitoring key aid in quality assurance 16 ✔sparqles
  • 17. © 2015 SIB Key-Value orientated SPARQL endpoint anyone? • assume 400 million named graphs – average 50 triples • max 5000 triples – get the whole named graph • single IO operation 17 CONSTRUCT {} FROM uniprot:P05067 WHERE {}