SlideShare une entreprise Scribd logo
1  sur  15
C O M P U T E | S T O R E | A N A L Y Z E
Challenges and Patterns for
Semantics at Scale
Rob Vesse
rvesse@cray.com
@RobVesse
C O M P U T E | S T O R E | A N A L Y Z E
Overview
● Background
● Challenges & Patterns
● Obtaining Data
● Input Format
● Blank Nodes
● Graph Partitioning
● Benchmarking
C O M P U T E | S T O R E | A N A L Y Z E
Background
● PhD in Computer Science
● Open Source
● Apache Jena
● dotNetRDF
● Software Engineer at Cray Inc
● In Analytics R&D
● Last 5 years
● Cray sells a range of analytics products
● Cray Graph Engine
● Massively scalable parallel RDF database and SPARQL engine
● Runs on GX and XC hardware platforms
● GX nodes are roughly equivalent to r3.8xlarge EC2 instance
C O M P U T E | S T O R E | A N A L Y Z E
Background - Terminology
● What do we mean by at scale?
● Typical customers have 10s of billions of triples
● Some are around the 100 billion mark
● What do we mean by parallelism?
● On node i.e. multiple threads/processes
● Across nodes i.e. multiple machines
C O M P U T E | S T O R E | A N A L Y Z E
Challenge #1 - Obtaining Data
● Most Data does not start out as RDF
● Relational databases, spreadsheets, structured/semi-structured
data, flat files etc.
● It varies depending on customer domain
● Therefore the first challenge is to get the data into RDF
● Problems
● Many ETL tools don't support it as an output format
● Even if tools do support it they are not scalable
● E.g D2RQ (http://d2rq.org)
C O M P U T E | S T O R E | A N A L Y Z E
Pattern #1 - Leverage Big Data
● Lots of big data projects can be used to implement ETL
pipelines
● E.g. Map Reduce, Spark, Flume, Sqoop
● There are some libraries available that provide basic
plumbing for this e.g.
● Apache Jena Elephas
● http://jena.apache.org/documentation/hadoop/index.html
● Unfortunately ETL tends to be very customer and data
specific
C O M P U T E | S T O R E | A N A L Y Z E
Challenge #2 - Input Format
● What data format should we be using?
● There are at least four widely used standard
serialisations:
● NTriples/NQuads, Turtle/TriG, RDF/XML and JSON-LD
● Plus the variety of lesser used formats e.g. TriX, RDF/JSON,
HDT, RDF/Thrift, Sesame Binary RDF etc
● Choice of format affects how you process it
● Parallel processing
● Error Tolerance
● State Tracking
C O M P U T E | S T O R E | A N A L Y Z E
Pattern #2 - Use NTriples/NQuads
● Simple but effective
● Can be arbitrarily split into chunks
● E.g. Pick some number of bytes, split into chunks, seek from
chunk boundaries to find actual line boundaries, process line by
line
● Extremely error tolerant
● Every line can be processed independently without needing
any shared state
● Even this has challenges:
● Verbose format so large datasets require extremely large files
● Blank nodes can still be problematic
C O M P U T E | S T O R E | A N A L Y Z E
Challenge #3 - Blank Node Identifiers
● Specifications say that a blank node
identifier is file scoped
● I.e. _:foo in a.nt is a different node from
_:foo in b.nt
● And _:foo is the same node throughout
a.nt
● Need to consistently assign identifiers
despite processing the data in chunks on
different physical nodes
● Preferably without resorting to global
state/synchronisation
<urn:a> <urn:link> _:foo .
_:foo <urn:link> <urn:b> .
# Many 100,000s of lines later
<urn:z> <urn:link> _:foo .
_:foo <urn:value> “example” .
_:bar <urn:value> “other” .
a.nt
b.nt
C O M P U T E | S T O R E | A N A L Y Z E
Pattern #3 - Derived Blank Node Identifiers
● Derive identifiers from a combination of their local
identifier and a scope identifier
● E.g. _:foo and a.nt
● Derivation method doesn't matter provided it is:
● Scope aware
● Deterministic
● Some possibilities:
● One-way hash e.g. MD5
● Mathematical transform
● Seeded random number generator (RNG)
● Apache Jena uses seeded RNG
● Scope awareness achieved by seeding the RNG based upon
the filename
C O M P U T E | S T O R E | A N A L Y Z E
Challenge #4 - Graph Partitioning
● Open Problem
● NP Hard
● Large graphs are never going to be processable on a
single node
● Need to partition across multiple nodes
● Partitioning affects both storage and processing of a
graph
● May need different schemes depending on desired processing
C O M P U T E | S T O R E | A N A L Y Z E
Pattern #4 - Domain Specific/Avoid It!
● For specific workloads a domain specific partitioning will
be best
● Needs knowledge of data and workload
● E.g. Educating the Planet with Pearson
● If you can then avoid it!
● Take advantage of increasingly capable hardware
● Large memory sizes, non-volatile memory, RDMA, high speed
interconnects, SSDs
C O M P U T E | S T O R E | A N A L Y Z E
Challenge #5 - Benchmarking
● Many of the classic benchmarks were developed by
academics
● E.g. LUBM, SP2B
● Often aren’t representative of actual customer problems
● Many data generators are single threaded
● Difficult to generate large-scale datasets
C O M P U T E | S T O R E | A N A L Y Z E
Pattern #5 - Change Benchmarks
● Linked Data Benchmark Council (LDBC)
● Industry working group that develops standardised benchmarks
● Equivalent to Transaction Processing Council (TPC) in
relational database industry
● http://ldbcouncil.org
● Design your own
● https://github.com/rvesse/sparql-query-bm
● Improve an existing one
● https://github.com/rvesse/lubm-uba
● LUBM 8k (~ 1 Billion Triples) can be generated in under 7
minutes which is a 10x speed up
C O M P U T E | S T O R E | A N A L Y Z E
Questions?
rvesse@cray.com
@RobVesse

Contenu connexe

Tendances

1 R Tutorial Introduction
1 R Tutorial Introduction1 R Tutorial Introduction
1 R Tutorial IntroductionSakthi Dasans
 
Introduction to R
Introduction to RIntroduction to R
Introduction to RAjay Ohri
 
R programming language
R programming languageR programming language
R programming languageKeerti Verma
 
Are Linked Datasets fit for Open-domain Question Answering? A Quality Assessment
Are Linked Datasets fit for Open-domain Question Answering? A Quality AssessmentAre Linked Datasets fit for Open-domain Question Answering? A Quality Assessment
Are Linked Datasets fit for Open-domain Question Answering? A Quality AssessmentHarsh Thakkar
 
R Introduction
R IntroductionR Introduction
R Introductionschamber
 
Programming in C++ and Data Strucutres
Programming in C++ and Data StrucutresProgramming in C++ and Data Strucutres
Programming in C++ and Data StrucutresDr. C.V. Suresh Babu
 
R Programming: First Steps
R Programming: First StepsR Programming: First Steps
R Programming: First StepsRsquared Academy
 
F# Data: Making structured data first class citizens
F# Data: Making structured data first class citizensF# Data: Making structured data first class citizens
F# Data: Making structured data first class citizensTomas Petricek
 
R programming for data science
R programming for data scienceR programming for data science
R programming for data scienceSovello Hildebrand
 
Combining Textual and Graph-Based Features for Named Entity Disambiguation us...
Combining Textual and Graph-Based Features for Named Entity Disambiguation us...Combining Textual and Graph-Based Features for Named Entity Disambiguation us...
Combining Textual and Graph-Based Features for Named Entity Disambiguation us...shakimov
 
Publishing RDF SKOS with microservices
Publishing RDF SKOS with microservicesPublishing RDF SKOS with microservices
Publishing RDF SKOS with microservicesBart Hanssens
 
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force statusLDBC council
 

Tendances (20)

R programming
R programmingR programming
R programming
 
R programming
R programmingR programming
R programming
 
1 R Tutorial Introduction
1 R Tutorial Introduction1 R Tutorial Introduction
1 R Tutorial Introduction
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
R programming language
R programming languageR programming language
R programming language
 
R programming
R programmingR programming
R programming
 
Are Linked Datasets fit for Open-domain Question Answering? A Quality Assessment
Are Linked Datasets fit for Open-domain Question Answering? A Quality AssessmentAre Linked Datasets fit for Open-domain Question Answering? A Quality Assessment
Are Linked Datasets fit for Open-domain Question Answering? A Quality Assessment
 
R Introduction
R IntroductionR Introduction
R Introduction
 
Programming in C++ and Data Strucutres
Programming in C++ and Data StrucutresProgramming in C++ and Data Strucutres
Programming in C++ and Data Strucutres
 
R Programming: First Steps
R Programming: First StepsR Programming: First Steps
R Programming: First Steps
 
F# Data: Making structured data first class citizens
F# Data: Making structured data first class citizensF# Data: Making structured data first class citizens
F# Data: Making structured data first class citizens
 
R programming
R programmingR programming
R programming
 
R programming for data science
R programming for data scienceR programming for data science
R programming for data science
 
An Intoduction to R
An Intoduction to RAn Intoduction to R
An Intoduction to R
 
Incomplete Information in RDF
Incomplete Information in RDFIncomplete Information in RDF
Incomplete Information in RDF
 
Combining Textual and Graph-Based Features for Named Entity Disambiguation us...
Combining Textual and Graph-Based Features for Named Entity Disambiguation us...Combining Textual and Graph-Based Features for Named Entity Disambiguation us...
Combining Textual and Graph-Based Features for Named Entity Disambiguation us...
 
R language
R languageR language
R language
 
Publishing RDF SKOS with microservices
Publishing RDF SKOS with microservicesPublishing RDF SKOS with microservices
Publishing RDF SKOS with microservices
 
R Programming
R ProgrammingR Programming
R Programming
 
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
8th TUC Meeting - Peter Boncz (CWI). Query Language Task Force status
 

En vedette

Practical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking RevisitedPractical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking RevisitedRob Vesse
 
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OOVirtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OOPaolo Cristofaro
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemRob Vesse
 
Everyday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web DeveloperEveryday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web DeveloperRob Vesse
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and FriendsRob Vesse
 
Introducing JDBC for SPARQL
Introducing JDBC for SPARQLIntroducing JDBC for SPARQL
Introducing JDBC for SPARQLRob Vesse
 

En vedette (6)

Practical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking RevisitedPractical SPARQL Benchmarking Revisited
Practical SPARQL Benchmarking Revisited
 
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OOVirtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
 
Quadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystemQuadrupling your elephants - RDF and the Hadoop ecosystem
Quadrupling your elephants - RDF and the Hadoop ecosystem
 
Everyday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web DeveloperEveryday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web Developer
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and Friends
 
Introducing JDBC for SPARQL
Introducing JDBC for SPARQLIntroducing JDBC for SPARQL
Introducing JDBC for SPARQL
 

Similaire à Challenges and patterns for semantics at scale

Leveraging open source for large scale analytics
Leveraging open source for large scale analyticsLeveraging open source for large scale analytics
Leveraging open source for large scale analyticsSouth West Data Meetup
 
Software Craftmanship - Cours Polytech
Software Craftmanship - Cours PolytechSoftware Craftmanship - Cours Polytech
Software Craftmanship - Cours Polytechyannick grenzinger
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionFormulatedby
 
A Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm BasebandsA Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm BasebandsPriyanka Aash
 
Etl confessions pg conf us 2017
Etl confessions   pg conf us 2017Etl confessions   pg conf us 2017
Etl confessions pg conf us 2017Corey Huinker
 
Everything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDBEverything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDBjhugg
 
Olap scalability
Olap scalabilityOlap scalability
Olap scalabilitylucboudreau
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriDemi Ben-Ari
 
Oracle to Postgres Schema Migration Hustle
Oracle to Postgres Schema Migration HustleOracle to Postgres Schema Migration Hustle
Oracle to Postgres Schema Migration HustleEDB
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flinkdatamantra
 
Heterogenous Persistence
Heterogenous PersistenceHeterogenous Persistence
Heterogenous PersistenceJervin Real
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup Omid Vahdaty
 
Note for Java Programming////////////////
Note for Java Programming////////////////Note for Java Programming////////////////
Note for Java Programming////////////////MeghaKulkarni27
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine LearningSudarsun Santhiappan
 
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...VMware Tanzu
 
Ceph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade inCeph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade inCeph Community
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 

Similaire à Challenges and patterns for semantics at scale (20)

Leveraging open source for large scale analytics
Leveraging open source for large scale analyticsLeveraging open source for large scale analytics
Leveraging open source for large scale analytics
 
Software Craftmanship - Cours Polytech
Software Craftmanship - Cours PolytechSoftware Craftmanship - Cours Polytech
Software Craftmanship - Cours Polytech
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
 
A Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm BasebandsA Journey into Hexagon: Dissecting Qualcomm Basebands
A Journey into Hexagon: Dissecting Qualcomm Basebands
 
Etl confessions pg conf us 2017
Etl confessions   pg conf us 2017Etl confessions   pg conf us 2017
Etl confessions pg conf us 2017
 
Everything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDBEverything We Learned About In-Memory Data Layout While Building VoltDB
Everything We Learned About In-Memory Data Layout While Building VoltDB
 
Olap scalability
Olap scalabilityOlap scalability
Olap scalability
 
Apache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-AriApache Spark 101 - Demi Ben-Ari
Apache Spark 101 - Demi Ben-Ari
 
Oracle to Postgres Schema Migration Hustle
Oracle to Postgres Schema Migration HustleOracle to Postgres Schema Migration Hustle
Oracle to Postgres Schema Migration Hustle
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
 
Heterogenous Persistence
Heterogenous PersistenceHeterogenous Persistence
Heterogenous Persistence
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Note for Java Programming////////////////
Note for Java Programming////////////////Note for Java Programming////////////////
Note for Java Programming////////////////
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
 
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
 
Ceph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade inCeph Day New York: Ceph: one decade in
Ceph Day New York: Ceph: one decade in
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 

Dernier

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 

Dernier (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 

Challenges and patterns for semantics at scale

  • 1. C O M P U T E | S T O R E | A N A L Y Z E Challenges and Patterns for Semantics at Scale Rob Vesse rvesse@cray.com @RobVesse
  • 2. C O M P U T E | S T O R E | A N A L Y Z E Overview ● Background ● Challenges & Patterns ● Obtaining Data ● Input Format ● Blank Nodes ● Graph Partitioning ● Benchmarking
  • 3. C O M P U T E | S T O R E | A N A L Y Z E Background ● PhD in Computer Science ● Open Source ● Apache Jena ● dotNetRDF ● Software Engineer at Cray Inc ● In Analytics R&D ● Last 5 years ● Cray sells a range of analytics products ● Cray Graph Engine ● Massively scalable parallel RDF database and SPARQL engine ● Runs on GX and XC hardware platforms ● GX nodes are roughly equivalent to r3.8xlarge EC2 instance
  • 4. C O M P U T E | S T O R E | A N A L Y Z E Background - Terminology ● What do we mean by at scale? ● Typical customers have 10s of billions of triples ● Some are around the 100 billion mark ● What do we mean by parallelism? ● On node i.e. multiple threads/processes ● Across nodes i.e. multiple machines
  • 5. C O M P U T E | S T O R E | A N A L Y Z E Challenge #1 - Obtaining Data ● Most Data does not start out as RDF ● Relational databases, spreadsheets, structured/semi-structured data, flat files etc. ● It varies depending on customer domain ● Therefore the first challenge is to get the data into RDF ● Problems ● Many ETL tools don't support it as an output format ● Even if tools do support it they are not scalable ● E.g D2RQ (http://d2rq.org)
  • 6. C O M P U T E | S T O R E | A N A L Y Z E Pattern #1 - Leverage Big Data ● Lots of big data projects can be used to implement ETL pipelines ● E.g. Map Reduce, Spark, Flume, Sqoop ● There are some libraries available that provide basic plumbing for this e.g. ● Apache Jena Elephas ● http://jena.apache.org/documentation/hadoop/index.html ● Unfortunately ETL tends to be very customer and data specific
  • 7. C O M P U T E | S T O R E | A N A L Y Z E Challenge #2 - Input Format ● What data format should we be using? ● There are at least four widely used standard serialisations: ● NTriples/NQuads, Turtle/TriG, RDF/XML and JSON-LD ● Plus the variety of lesser used formats e.g. TriX, RDF/JSON, HDT, RDF/Thrift, Sesame Binary RDF etc ● Choice of format affects how you process it ● Parallel processing ● Error Tolerance ● State Tracking
  • 8. C O M P U T E | S T O R E | A N A L Y Z E Pattern #2 - Use NTriples/NQuads ● Simple but effective ● Can be arbitrarily split into chunks ● E.g. Pick some number of bytes, split into chunks, seek from chunk boundaries to find actual line boundaries, process line by line ● Extremely error tolerant ● Every line can be processed independently without needing any shared state ● Even this has challenges: ● Verbose format so large datasets require extremely large files ● Blank nodes can still be problematic
  • 9. C O M P U T E | S T O R E | A N A L Y Z E Challenge #3 - Blank Node Identifiers ● Specifications say that a blank node identifier is file scoped ● I.e. _:foo in a.nt is a different node from _:foo in b.nt ● And _:foo is the same node throughout a.nt ● Need to consistently assign identifiers despite processing the data in chunks on different physical nodes ● Preferably without resorting to global state/synchronisation <urn:a> <urn:link> _:foo . _:foo <urn:link> <urn:b> . # Many 100,000s of lines later <urn:z> <urn:link> _:foo . _:foo <urn:value> “example” . _:bar <urn:value> “other” . a.nt b.nt
  • 10. C O M P U T E | S T O R E | A N A L Y Z E Pattern #3 - Derived Blank Node Identifiers ● Derive identifiers from a combination of their local identifier and a scope identifier ● E.g. _:foo and a.nt ● Derivation method doesn't matter provided it is: ● Scope aware ● Deterministic ● Some possibilities: ● One-way hash e.g. MD5 ● Mathematical transform ● Seeded random number generator (RNG) ● Apache Jena uses seeded RNG ● Scope awareness achieved by seeding the RNG based upon the filename
  • 11. C O M P U T E | S T O R E | A N A L Y Z E Challenge #4 - Graph Partitioning ● Open Problem ● NP Hard ● Large graphs are never going to be processable on a single node ● Need to partition across multiple nodes ● Partitioning affects both storage and processing of a graph ● May need different schemes depending on desired processing
  • 12. C O M P U T E | S T O R E | A N A L Y Z E Pattern #4 - Domain Specific/Avoid It! ● For specific workloads a domain specific partitioning will be best ● Needs knowledge of data and workload ● E.g. Educating the Planet with Pearson ● If you can then avoid it! ● Take advantage of increasingly capable hardware ● Large memory sizes, non-volatile memory, RDMA, high speed interconnects, SSDs
  • 13. C O M P U T E | S T O R E | A N A L Y Z E Challenge #5 - Benchmarking ● Many of the classic benchmarks were developed by academics ● E.g. LUBM, SP2B ● Often aren’t representative of actual customer problems ● Many data generators are single threaded ● Difficult to generate large-scale datasets
  • 14. C O M P U T E | S T O R E | A N A L Y Z E Pattern #5 - Change Benchmarks ● Linked Data Benchmark Council (LDBC) ● Industry working group that develops standardised benchmarks ● Equivalent to Transaction Processing Council (TPC) in relational database industry ● http://ldbcouncil.org ● Design your own ● https://github.com/rvesse/sparql-query-bm ● Improve an existing one ● https://github.com/rvesse/lubm-uba ● LUBM 8k (~ 1 Billion Triples) can be generated in under 7 minutes which is a 10x speed up
  • 15. C O M P U T E | S T O R E | A N A L Y Z E Questions? rvesse@cray.com @RobVesse