SlideShare une entreprise Scribd logo
1  sur  26
From relational databases to linked
   data:R for the semantic web
            Jose Quesada,
      Max Planck Institute, Berlin
Who this talk targets
• You have big data; you use a database

• You have an evolving schema definition.
  Sometimes at runtime

• You are interested in alternative ways to present
  your data

• You would thrive by using data out there, if only
  they were more accessible
Semantic web
Credit: Jim Hendler

THE TWO TOWERS
The Semantic web
        • Ontology as Barad-dur
          (Sauron’s tower)
          – Extremely powerful

          – Patrolled by Orcs
             • Let one little hobbit in it,
               and the whole thing could
               come crashing down
          – OWL
The Semantic web
        • Ontology as Barad-dur
          (Sauron’s tower)
          – Extremely powerful
                     Decidable logic basis
          – Patrolled by Orcs
                           inconsistency
             • Let one little hobbit in it,
               and the whole thing could
               come crashing down
          – OWL
Inconsistency
The semantic web
        • The tower of Babel
           – We will build a tower to
             reach the sky
           – We only need a little
             ontological agreement
              • Who cares if we all speak
                different languages?
        This is RDFS
        Statistics matter here
        Web-scale
        Lots of data; finding
          anything in the mess can
          be a win
Approaches to data representation

•   Objects
•   Tables (relational databases)
•   Non-relational databases
•   Tables (data.frame)
•   Graphs
What one can do with semantic web data,
now:
 People that died in Nazi Germany and if possible, any
 notable works that they might have created
SELECT *
WHERE {
  ?subject dbpprop:deathPlace
<http://dbpedia.org/resource/Nazi_Germany> .
  OPTIONAL {
    ?subject dbpedia-owl:notableworks ?works
  }
}
subject                          works
:Anne_Frank                      :The_Diary_of_a_Young_Girl
:Martin_Bormann                  -
:Ir%C3%A8ne_N%C3%A9mirovsky -
:Erich_Fellgiebel                -
:Friedrich_Ferdinand%2C_Duke_of_Schleswig-Holstein
                                 -

:Friedrich_Olbricht              -
:Ludwig_Beck                     -
:Erwin_Rommel                    -
:Maurice_Bavaud                  -
:Early_Years_of_Adolf_Hitler     -
:Emil_Zegad%C5%82owicz           -
:Friedrich_Fromm                 -
:Helmuth_James_Graf_von_Moltke
• Scale to the entire web   • Use cases:
                               – Real time city
                               – Cancer monographs for
• Do reasoning with open         WHO
  word assumption              – Gene expression finding

• Retrieval in real-time

• Go beyond logics
RDF is a graph
• We have lots of interesting statistics that run on graphs

• In many Semantic Web (SW) domains a tremendous
  amount of statements (expressed as triples) might be
  true but, in a given domain, only a small number of
  statements is known to be true or can be inferred to be
  true. It thus makes sense to attempt to estimate the
  truth values of statements by exploring regularities in
  the SW data with machine learning
Scale
• You cannot use the entire thing at once:
  subsetting

• Are there patterns in knowledge structures
  that we can use for subsetting?
Idea
• Graph theory applied to subsetting large graphs

• Developing Semantic Web applications requires
  handling the RDF data model in a programming
  language

• Problem: current software is developed in the
  object-oriented paradigm, programming in RDF is
  currently triple-based.
Data
   IMDB is a big graph:
   – 1.4 m movies
   – 1.7 m actors
   – 11 M connections
        • Movies have votes
   – Bipartite network

Packages: igraph:
   –   Nice functions that you cannot find anywhere else
   –   Uses Sparse Matrices
   –   Implemented in C
   –   Some support for bipartite networks

Rmysql, Matrix (sparse m)
Centrality
Centrality
Pagerank
  • The pagerank vector is
    the stationary
    distribution of a markov           1                3
    chain in a link matrix

  • Some assumptions to               2                 4
    warrant convergence

  • The typical value of d
    is .85
norm <- function(x) x/sum(x)
norm(eigen(0.15/nVertices + 0.85 * t(A))$vectors[,1])
Top movies by pageRank
             in the actor->movie network

degree pagerank      cluster imdbID   title                               rank       votes
       0.000243688
  1298 252192870          0    822609Around the World in Eighty Days (1956) 40031 6134
       0.000103540
   313 862390464          0     76352Beyond Our Control" (1968)"               0       0
       0.000091669
   291 0099912811         0    993780Gone to Earth (1950)                 7.0          291
       0.000089025
   285 5923652847         0    915626Deadlands 2: Trapped (2008)          39971         15
       0.000083882
   424 328163772          0 1282574Stuck on You (2003)                    6.0        19709
       0.000080824
   629 1101098043         0    622100Shortland Street" (1992)"          39850        225
Problems
• Graphs have advantages over
  RDBMS/tables[1]. But we are used to think in
  tables
• There is no direct way to handle RDF in R.
  worth an R package?
Linked data are out there for the grabs

We need to start thinking in terms of graphs,
and slowly move away from tables




    Thanks for your attention
    Jose Quesada, quesada@workingcogs.com, http://josequesada.name
    Twitter: @Quesada

Contenu connexe

Similaire à R for the semantic web, Quesada useR 2009

Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 
Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?Saltmarch Media
 
Involutionary%20Self-Replicating%20Machines.ppt_1
Involutionary%20Self-Replicating%20Machines.ppt_1Involutionary%20Self-Replicating%20Machines.ppt_1
Involutionary%20Self-Replicating%20Machines.ppt_1David Keirsey
 
"Navigating the Database Universe" by Dr. Michael Stonebraker and Scott Jarr,...
"Navigating the Database Universe" by Dr. Michael Stonebraker and Scott Jarr,..."Navigating the Database Universe" by Dr. Michael Stonebraker and Scott Jarr,...
"Navigating the Database Universe" by Dr. Michael Stonebraker and Scott Jarr,...lisapaglia
 
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDBBig Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDBBigDataCloud
 
Where Does Big Data Meet Big Database - QCon 2012
Where Does Big Data Meet Big Database - QCon 2012Where Does Big Data Meet Big Database - QCon 2012
Where Does Big Data Meet Big Database - QCon 2012Ben Stopford
 
Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQueryCsaba Toth
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and GiraphDoug Needham
 
What Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will WinWhat Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will WinBigDataCloud
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databasesthai
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterMat Keep
 
DownTheRabbitHole.js – How to Stay Sane in an Insane Ecosystem
DownTheRabbitHole.js – How to Stay Sane in an Insane EcosystemDownTheRabbitHole.js – How to Stay Sane in an Insane Ecosystem
DownTheRabbitHole.js – How to Stay Sane in an Insane EcosystemFITC
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, HowIgor Moochnick
 
InfiniteGraph Presentation from Oct 21, 2010 DBTA Webcast
InfiniteGraph Presentation from Oct 21, 2010 DBTA WebcastInfiniteGraph Presentation from Oct 21, 2010 DBTA Webcast
InfiniteGraph Presentation from Oct 21, 2010 DBTA WebcastInfiniteGraph
 
A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebSimon Price
 
Scaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBaseScaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBaseAge Mooij
 
BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)Ashok Rangaswamy
 
DownTheRabbitHole.js – How to Stay Sane in an Insane Ecosystem
DownTheRabbitHole.js – How to Stay Sane in an Insane EcosystemDownTheRabbitHole.js – How to Stay Sane in an Insane Ecosystem
DownTheRabbitHole.js – How to Stay Sane in an Insane EcosystemFITC
 

Similaire à R for the semantic web, Quesada useR 2009 (20)

Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?Is NoSQL The Future of Data Storage?
Is NoSQL The Future of Data Storage?
 
Involutionary%20Self-Replicating%20Machines.ppt_1
Involutionary%20Self-Replicating%20Machines.ppt_1Involutionary%20Self-Replicating%20Machines.ppt_1
Involutionary%20Self-Replicating%20Machines.ppt_1
 
"Navigating the Database Universe" by Dr. Michael Stonebraker and Scott Jarr,...
"Navigating the Database Universe" by Dr. Michael Stonebraker and Scott Jarr,..."Navigating the Database Universe" by Dr. Michael Stonebraker and Scott Jarr,...
"Navigating the Database Universe" by Dr. Michael Stonebraker and Scott Jarr,...
 
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDBBig Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
Big Data Cloud Meetup - Jan 29 2013 - Mike Stonebraker & Scott Jarr of VoltDB
 
Where Does Big Data Meet Big Database - QCon 2012
Where Does Big Data Meet Big Database - QCon 2012Where Does Big Data Meet Big Database - QCon 2012
Where Does Big Data Meet Big Database - QCon 2012
 
Introduction to Google BigQuery
Introduction to Google BigQueryIntroduction to Google BigQuery
Introduction to Google BigQuery
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and Giraph
 
What Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will WinWhat Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will Win
 
Spark
SparkSpark
Spark
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL Cluster
 
DownTheRabbitHole.js – How to Stay Sane in an Insane Ecosystem
DownTheRabbitHole.js – How to Stay Sane in an Insane EcosystemDownTheRabbitHole.js – How to Stay Sane in an Insane Ecosystem
DownTheRabbitHole.js – How to Stay Sane in an Insane Ecosystem
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, How
 
InfiniteGraph Presentation from Oct 21, 2010 DBTA Webcast
InfiniteGraph Presentation from Oct 21, 2010 DBTA WebcastInfiniteGraph Presentation from Oct 21, 2010 DBTA Webcast
InfiniteGraph Presentation from Oct 21, 2010 DBTA Webcast
 
Nosql public
Nosql publicNosql public
Nosql public
 
A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic Web
 
Scaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBaseScaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBase
 
BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)
 
DownTheRabbitHole.js – How to Stay Sane in an Insane Ecosystem
DownTheRabbitHole.js – How to Stay Sane in an Insane EcosystemDownTheRabbitHole.js – How to Stay Sane in an Insane Ecosystem
DownTheRabbitHole.js – How to Stay Sane in an Insane Ecosystem
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 

Dernier (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

R for the semantic web, Quesada useR 2009

  • 1. From relational databases to linked data:R for the semantic web Jose Quesada, Max Planck Institute, Berlin
  • 2. Who this talk targets • You have big data; you use a database • You have an evolving schema definition. Sometimes at runtime • You are interested in alternative ways to present your data • You would thrive by using data out there, if only they were more accessible
  • 4.
  • 5.
  • 7. The Semantic web • Ontology as Barad-dur (Sauron’s tower) – Extremely powerful – Patrolled by Orcs • Let one little hobbit in it, and the whole thing could come crashing down – OWL
  • 8. The Semantic web • Ontology as Barad-dur (Sauron’s tower) – Extremely powerful Decidable logic basis – Patrolled by Orcs inconsistency • Let one little hobbit in it, and the whole thing could come crashing down – OWL
  • 10. The semantic web • The tower of Babel – We will build a tower to reach the sky – We only need a little ontological agreement • Who cares if we all speak different languages? This is RDFS Statistics matter here Web-scale Lots of data; finding anything in the mess can be a win
  • 11. Approaches to data representation • Objects • Tables (relational databases) • Non-relational databases • Tables (data.frame) • Graphs
  • 12. What one can do with semantic web data, now: People that died in Nazi Germany and if possible, any notable works that they might have created SELECT * WHERE { ?subject dbpprop:deathPlace <http://dbpedia.org/resource/Nazi_Germany> . OPTIONAL { ?subject dbpedia-owl:notableworks ?works } }
  • 13. subject works :Anne_Frank :The_Diary_of_a_Young_Girl :Martin_Bormann - :Ir%C3%A8ne_N%C3%A9mirovsky - :Erich_Fellgiebel - :Friedrich_Ferdinand%2C_Duke_of_Schleswig-Holstein - :Friedrich_Olbricht - :Ludwig_Beck - :Erwin_Rommel - :Maurice_Bavaud - :Early_Years_of_Adolf_Hitler - :Emil_Zegad%C5%82owicz - :Friedrich_Fromm - :Helmuth_James_Graf_von_Moltke
  • 14. • Scale to the entire web • Use cases: – Real time city – Cancer monographs for • Do reasoning with open WHO word assumption – Gene expression finding • Retrieval in real-time • Go beyond logics
  • 15. RDF is a graph • We have lots of interesting statistics that run on graphs • In many Semantic Web (SW) domains a tremendous amount of statements (expressed as triples) might be true but, in a given domain, only a small number of statements is known to be true or can be inferred to be true. It thus makes sense to attempt to estimate the truth values of statements by exploring regularities in the SW data with machine learning
  • 16. Scale • You cannot use the entire thing at once: subsetting • Are there patterns in knowledge structures that we can use for subsetting?
  • 17.
  • 18. Idea • Graph theory applied to subsetting large graphs • Developing Semantic Web applications requires handling the RDF data model in a programming language • Problem: current software is developed in the object-oriented paradigm, programming in RDF is currently triple-based.
  • 19. Data IMDB is a big graph: – 1.4 m movies – 1.7 m actors – 11 M connections • Movies have votes – Bipartite network Packages: igraph: – Nice functions that you cannot find anywhere else – Uses Sparse Matrices – Implemented in C – Some support for bipartite networks Rmysql, Matrix (sparse m)
  • 22. Pagerank • The pagerank vector is the stationary distribution of a markov 1 3 chain in a link matrix • Some assumptions to 2 4 warrant convergence • The typical value of d is .85 norm <- function(x) x/sum(x) norm(eigen(0.15/nVertices + 0.85 * t(A))$vectors[,1])
  • 23.
  • 24. Top movies by pageRank in the actor->movie network degree pagerank cluster imdbID title rank votes 0.000243688 1298 252192870 0 822609Around the World in Eighty Days (1956) 40031 6134 0.000103540 313 862390464 0 76352Beyond Our Control" (1968)" 0 0 0.000091669 291 0099912811 0 993780Gone to Earth (1950) 7.0 291 0.000089025 285 5923652847 0 915626Deadlands 2: Trapped (2008) 39971 15 0.000083882 424 328163772 0 1282574Stuck on You (2003) 6.0 19709 0.000080824 629 1101098043 0 622100Shortland Street" (1992)" 39850 225
  • 25. Problems • Graphs have advantages over RDBMS/tables[1]. But we are used to think in tables • There is no direct way to handle RDF in R. worth an R package?
  • 26. Linked data are out there for the grabs We need to start thinking in terms of graphs, and slowly move away from tables Thanks for your attention Jose Quesada, quesada@workingcogs.com, http://josequesada.name Twitter: @Quesada