SlideShare une entreprise Scribd logo
1  sur  32
Télécharger pour lire hors ligne
Introduction         Existing work         Own work    Short demo       Future work   Questions




                                     Large Knowledge Bases
                                       Storage and Distribution


                                     PAIVA LIMA DA SILVA Bruno

                        Universit´ Montpellier 2 (EPI GraphIK - INRIA - CNRS)
                                 e


                                           February 28, 2011




Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                          1 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




       1       Introduction

       2       Existing work

       3       Own work

       4       Short demo

       5       Future work

       6       Questions



Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                     2 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




Table of Contents


       1       Introduction

       2       Existing work

       3       Own work

       4       Short demo

       5       Future work

       6       Questions



Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                     3 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




Problem description


       Beginning of the PhD in October 2010 within the GraphIK team.
       Problem:
               Conjunctive query answering over large knowledge bases.
               Emergence of many, and larger and larger knowledge bases all
               around. (Social networks, ontologies, Linked Data, etc.)
       New paradigm:
               How does the fact that the size of KBs is increasing very fast
               affect our problem?




Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                     4 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




Possible approaches


       Three different approaches to address the deduction problem when
       dealing with large KBs:
               Algorithmic approach: Filtering, reducing exploration path,
               etc. (Team and other teams previous work)
               Approximative approach: “Pseudo-deduction”, probabilistic
               approach. (See Nicolas and others work)
               Storage methods: To check (and to find) one or more (the
               best) storage method that would enable us to perform
               conjunctive query answering with good efficiency. (Context of
               this thesis)



Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                     5 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




Well-known storage methods


       Well known storage methods:


       Graphs in main memory storage.
               Known by having solid efficiency for most basic operations
               needed for reasoning.
       Problem:
               Large does not fit in main memory.




Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                     6 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




Relational databases

               Known for supporting very well large knowledge bases, stored
               in main or secondary memory.
               Deduction may be made by using a SQL interface. (also
               translatable in BackTrack algorithms)
       Problems:
               SQL: Bad with semi-structured data, joining tables become
               highly “expensive” very quickly.
               Backtrack: Uses SQL for every basic operation, loss of
               efficiency in almost every basic operation call.
       RDBs May work very well in some cases (structured data), but
       may also very poorly in other situations.

Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                     7 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




Objectives

       Thesis goals:
               To identify different storage systems adapted to our problem.
               To have tools that enable us to compare several storage
               system between themselves.
               To answer: Which storage works better than the other? For
               which kind of data?
       Storage paradigms:
               To compare them when testing real data vs. unbiased
               generated data.
               To find ways to improve current algorithms for these systems
               based on our results.

Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                     8 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




Table of Contents


       1       Introduction

       2       Existing work

       3       Own work

       4       Short demo

       5       Future work

       6       Questions



Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                     9 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




Existing Work



               Description of the NoSQL “movement” and some of its
               features.
               Quick overview of some NoSQL popular projects and their
               particular characteristics.
               Currently available Graph databases.
               Description of Blueprints: a tool that uses a software-graph
               vision for the unification of differently implemented graph
               databases.




Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                    10 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




NoSQL: Short history


               1998: Carlo Strozzi started the ”NoSQL” project, a relational
               database without SQL interface.
               The project failed. He said that the problem was not on SQL,
               but on relational model.
               ”NoRel” ?
               NoSQL became then “Not Only SQL”.
               1999: First NoSQL Conference, called no:sql(east)
               Worst tagline ever: “select fun, profit from real world where
               relational=false”



Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                    11 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




NoSQL: What is it?



               ”Not Only SQL” may also be easily seen as “Non-relational”.
               That said, the NoSQL community around the world may also
               be easily seen as the union of different databases management
               systems that differ from the classic relational database model
               from Codd.
               Some of those systems may be put together into groups:
               Key-Value store, Object databases, XML databases, Graph
               databases, Document store. (Document as in JSON)




Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                    12 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




CAP Theorem


       History:
               2000: Conjecture made by Eric Brewer.
               2002: Formally proved by MIT scientists.
       Theorem:
       “A distributed storage system can satisfy any two of the following
       guarantees at the same time, but not all three:”
               Consistency.
               Availability.
               Partition tolerance.



Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                    13 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




CAP Theorem




Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                    14 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




NoSQL: Popular NoSQL projects

       Some NoSQL have become popular for being supported by big IT
       companies:
               Google BigTable (HBase implementation for Hadoop)
               Storage: Distributed multidimensional maps, Querying: GQL.
               Project Voldemort (Linkedin)
               Storage: Distributed Key-Value, Querying: Hashtable
               semantics.
               Apache Cassandra (Facebook,Digg)
               Storage: Distributed Key-Value, Querying: SQL-like.
               MongoDB (foursquare)
               Storage: BSON (Binary JSON) Documents, Querying:
               SQL-like.

Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                    15 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




NoSQL: Current graph DBs

       There are today several different implementations of Graph
       databases available for commercial and personal use:
               Neo4J,
               InfiniteGraph,
               sones,
               DEX,
               OrientDB,
               HyperGraphDB,
               and others...
       Some can however be regrouped since they do implement a
       “common” graph model:

Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                    16 / 32
Introduction         Existing work         Own work               Short demo     Future work    Questions




Property Graph Model

       [Source: UIM-GDB “Connections to the real world” presentation]
       The Property Graph model:


                                                      works-for
                                       {since : 23th Feb; contract : cdd}
        {id : 1; name : bob}                                                   {id : 2; name : tom}


       Characteristics of the model:
               directed: Each edge has a source and destination vertex.
               attributed: Vertices and edges carry key-value pairs.
               edge-labeled: The label denotes the type of the relationship.
               multigraph: Multiple edges between any two vertices allowed.

Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                                   17 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




Blueprints project




               Blueprints is a project from TinkerPop team. (±7 developers)
               Collections of interfaces and implementations for the Property
               Graph Model.
               Brings interoperability between Graph DBs by allowing
               developers to plug-and-play several Graph DB backends.
               Also unifies the indexing and transactions methods for those
               backends.

Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                    18 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




TinkerPop Team
       Other projects from the team that has created Blueprints:
               Gremlin: A graph traversal language based on Blueprints
               implementations.
               Rexster: Web service providing basic REST (GET, POST...)
               methods for stored Blueprints graphs.
               Graphdb-Bench: A Graph DB benchmarking tool that allows
               one to define and run benchmarks against different Graph DB
               implementations.




Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                    19 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




Table of Contents


       1       Introduction

       2       Existing work

       3       Own work

       4       Short demo

       5       Future work

       6       Questions



Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                    20 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




Details of the work

       Detailed below, the first steps of my work on the subject:
               To find all the potential storage methods and implementations
               that could be interesting to test against the classic methods.
               To design an architecture with a logic based vision that
               regroups different storage systems and enable us to perform
               operations over each one of them using an unique language.
               To write code clean enough to enable our architecture to scale
               up horizontally and vertically without having to restart all over
               again.
               To have an abstract platform solid enough that could regroup
               previous and future works of GraphIK team within an only
               software architecture.

Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                    21 / 32
Introduction         Existing work         Own work   Short demo   Future work    Questions




Problems ahead



               Not every NoSQL, and principally Graph DB implementation
               available is free or open source. (not easily testable then)
               Some Graph DBs may be easily plugged because of the
               Blueprints project, others, less easily. There is not actually a
               global notion of interoperability between those DBs yet.
               Reapparition of an old dilemma in software engineering: More
               general code vs. Specific and efficient code.




Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                     22 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




API Architecture




                      Figure: Current class diagram of our architecture.




Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                    23 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




Application description



               Implementation of an IFact interface that ensure that all the
               storage systems plugged will answer to the same methods.
               Implementation of Atom, Term and Predicate classes that
               allows all structures to communicate using a common type of
               data.
               In the case of our work, we have decided to use, by default,
               the fact that if two terms do have the same label, then they
               represent the same entity.




Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                    24 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




Technical details

               Due to the facility of plugging several different pieces of code
               together, we have decided to write our framework in JAVA.
               Pros: Almost everything can be easily plugged in.
               Cons: Loss in speed and efficiency.
               For now, our structure can handle a Relational DB (Sqlite)
               and adjacency lists stored graph. (in memory)
               We have then written a simple command-line interface in
               order to interact with the framework. (See demo next)
               The command-line interface instantiates an (almost) empty
               KB and has several commands for storing and querying our
               pieces of information.

Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                    25 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




Table of Contents


       1       Introduction

       2       Existing work

       3       Own work

       4       Short demo

       5       Future work

       6       Questions



Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                    26 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




Screenshot of the application




                         Figure: Application running on Ubuntu Linux.

Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                    27 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




Table of Contents


       1       Introduction

       2       Existing work

       3       Own work

       4       Short demo

       5       Future work

       6       Questions



Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                    28 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




Near future

       We can list some things that we are looking to do in the near
       future:
               To run all the tests needed to confirm or to discard our initial
               hypothesis.
               To add to our framework the possibility of managing RDF and
               3-Store data and to compare them aswell.
               To start testing our framework structure with larger and
               larger datasets.
               To start identifying which storage method is more
               appropriated than other according to the kind or structure of
               the data we are dealing with.


Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                    29 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




Then after...


       And also some that we will surely look in a not-so-near future:
               To implement a benchmarking tool that would help us to
               compare properly the storage systems we have when dealing
               with real large datasets against unbiased generated ones.
               To identify more operations/use cases for which we can
               compare our data structures.
               To develop/build a Graphic User Interface on top of our
               framework.




Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                    30 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




Table of Contents


       1       Introduction

       2       Existing work

       3       Own work

       4       Short demo

       5       Future work

       6       Questions



Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                    31 / 32
Introduction         Existing work         Own work   Short demo   Future work   Questions




                          (More) Questions & (more) comments...




Large Knowledge Bases
PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2)                                    32 / 32

Contenu connexe

Similaire à Large Knowledge Bases

DeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François GarillotDeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François Garillotsparktc
 
DeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François GarillotDeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François GarillotSteve Moore
 
DRESD Project Presentation - December 2006
DRESD Project Presentation - December 2006DRESD Project Presentation - December 2006
DRESD Project Presentation - December 2006santa
 
discopen
discopendiscopen
discopenJisc
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDatabricks
 
Open experiments and open-source
Open experiments and open-sourceOpen experiments and open-source
Open experiments and open-sourcepeircej
 
CNI fall 2009 enhanced publications john_doove-SURFfoundation
CNI fall 2009 enhanced publications john_doove-SURFfoundationCNI fall 2009 enhanced publications john_doove-SURFfoundation
CNI fall 2009 enhanced publications john_doove-SURFfoundationJohn Doove
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) SkillsOscar Corcho
 
AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Introjeykottalam
 
MONAI and Open Science for Medical Imaging Deep Learning: SIPAIM 2020
MONAI and Open Science for Medical Imaging Deep Learning: SIPAIM 2020MONAI and Open Science for Medical Imaging Deep Learning: SIPAIM 2020
MONAI and Open Science for Medical Imaging Deep Learning: SIPAIM 2020Stephen Aylward
 
HPAI Class 2 - human aspects and computing systems in ai - 012920
HPAI  Class 2 - human aspects and computing systems in ai - 012920HPAI  Class 2 - human aspects and computing systems in ai - 012920
HPAI Class 2 - human aspects and computing systems in ai - 012920melendez321
 
Introduction to Distributed Computing Engines for Data Processing - Simone Ro...
Introduction to Distributed Computing Engines for Data Processing - Simone Ro...Introduction to Distributed Computing Engines for Data Processing - Simone Ro...
Introduction to Distributed Computing Engines for Data Processing - Simone Ro...Data Science Milan
 
The Unbearable Lightness of Wiking
The Unbearable Lightness of Wiking The Unbearable Lightness of Wiking
The Unbearable Lightness of Wiking Jie Bao
 
STM Innovations Seminar London
STM Innovations Seminar LondonSTM Innovations Seminar London
STM Innovations Seminar LondonPhilip Bourne
 
Deep learning and the systemic challenges of data science initiatives
Deep learning and the systemic challenges of data science initiativesDeep learning and the systemic challenges of data science initiatives
Deep learning and the systemic challenges of data science initiativesBalázs Kégl
 
Dame ivoa interop_brescia_naples2011
Dame ivoa interop_brescia_naples2011Dame ivoa interop_brescia_naples2011
Dame ivoa interop_brescia_naples2011INAF-OAC
 
Ml pluss ejan2013
Ml pluss ejan2013Ml pluss ejan2013
Ml pluss ejan2013CS, NcState
 
DL4J at Workday Meetup
DL4J at Workday MeetupDL4J at Workday Meetup
DL4J at Workday MeetupDavid Kale
 

Similaire à Large Knowledge Bases (20)

DeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François GarillotDeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François Garillot
 
DeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François GarillotDeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François Garillot
 
DRESD Project Presentation - December 2006
DRESD Project Presentation - December 2006DRESD Project Presentation - December 2006
DRESD Project Presentation - December 2006
 
discopen
discopendiscopen
discopen
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
 
Open experiments and open-source
Open experiments and open-sourceOpen experiments and open-source
Open experiments and open-source
 
CNI fall 2009 enhanced publications john_doove-SURFfoundation
CNI fall 2009 enhanced publications john_doove-SURFfoundationCNI fall 2009 enhanced publications john_doove-SURFfoundation
CNI fall 2009 enhanced publications john_doove-SURFfoundation
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
(Big) Data (Science) Skills
(Big) Data (Science) Skills(Big) Data (Science) Skills
(Big) Data (Science) Skills
 
AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Intro
 
MONAI and Open Science for Medical Imaging Deep Learning: SIPAIM 2020
MONAI and Open Science for Medical Imaging Deep Learning: SIPAIM 2020MONAI and Open Science for Medical Imaging Deep Learning: SIPAIM 2020
MONAI and Open Science for Medical Imaging Deep Learning: SIPAIM 2020
 
HPAI Class 2 - human aspects and computing systems in ai - 012920
HPAI  Class 2 - human aspects and computing systems in ai - 012920HPAI  Class 2 - human aspects and computing systems in ai - 012920
HPAI Class 2 - human aspects and computing systems in ai - 012920
 
Introduction to Distributed Computing Engines for Data Processing - Simone Ro...
Introduction to Distributed Computing Engines for Data Processing - Simone Ro...Introduction to Distributed Computing Engines for Data Processing - Simone Ro...
Introduction to Distributed Computing Engines for Data Processing - Simone Ro...
 
The Unbearable Lightness of Wiking
The Unbearable Lightness of Wiking The Unbearable Lightness of Wiking
The Unbearable Lightness of Wiking
 
STM Innovations Seminar London
STM Innovations Seminar LondonSTM Innovations Seminar London
STM Innovations Seminar London
 
Deep learning and the systemic challenges of data science initiatives
Deep learning and the systemic challenges of data science initiativesDeep learning and the systemic challenges of data science initiatives
Deep learning and the systemic challenges of data science initiatives
 
Dame ivoa interop_brescia_naples2011
Dame ivoa interop_brescia_naples2011Dame ivoa interop_brescia_naples2011
Dame ivoa interop_brescia_naples2011
 
Ml pluss ejan2013
Ml pluss ejan2013Ml pluss ejan2013
Ml pluss ejan2013
 
Solid principes
Solid principesSolid principes
Solid principes
 
DL4J at Workday Meetup
DL4J at Workday MeetupDL4J at Workday Meetup
DL4J at Workday Meetup
 

Dernier

Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 

Dernier (20)

INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 

Large Knowledge Bases

  • 1. Introduction Existing work Own work Short demo Future work Questions Large Knowledge Bases Storage and Distribution PAIVA LIMA DA SILVA Bruno Universit´ Montpellier 2 (EPI GraphIK - INRIA - CNRS) e February 28, 2011 Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 1 / 32
  • 2. Introduction Existing work Own work Short demo Future work Questions 1 Introduction 2 Existing work 3 Own work 4 Short demo 5 Future work 6 Questions Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 2 / 32
  • 3. Introduction Existing work Own work Short demo Future work Questions Table of Contents 1 Introduction 2 Existing work 3 Own work 4 Short demo 5 Future work 6 Questions Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 3 / 32
  • 4. Introduction Existing work Own work Short demo Future work Questions Problem description Beginning of the PhD in October 2010 within the GraphIK team. Problem: Conjunctive query answering over large knowledge bases. Emergence of many, and larger and larger knowledge bases all around. (Social networks, ontologies, Linked Data, etc.) New paradigm: How does the fact that the size of KBs is increasing very fast affect our problem? Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 4 / 32
  • 5. Introduction Existing work Own work Short demo Future work Questions Possible approaches Three different approaches to address the deduction problem when dealing with large KBs: Algorithmic approach: Filtering, reducing exploration path, etc. (Team and other teams previous work) Approximative approach: “Pseudo-deduction”, probabilistic approach. (See Nicolas and others work) Storage methods: To check (and to find) one or more (the best) storage method that would enable us to perform conjunctive query answering with good efficiency. (Context of this thesis) Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 5 / 32
  • 6. Introduction Existing work Own work Short demo Future work Questions Well-known storage methods Well known storage methods: Graphs in main memory storage. Known by having solid efficiency for most basic operations needed for reasoning. Problem: Large does not fit in main memory. Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 6 / 32
  • 7. Introduction Existing work Own work Short demo Future work Questions Relational databases Known for supporting very well large knowledge bases, stored in main or secondary memory. Deduction may be made by using a SQL interface. (also translatable in BackTrack algorithms) Problems: SQL: Bad with semi-structured data, joining tables become highly “expensive” very quickly. Backtrack: Uses SQL for every basic operation, loss of efficiency in almost every basic operation call. RDBs May work very well in some cases (structured data), but may also very poorly in other situations. Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 7 / 32
  • 8. Introduction Existing work Own work Short demo Future work Questions Objectives Thesis goals: To identify different storage systems adapted to our problem. To have tools that enable us to compare several storage system between themselves. To answer: Which storage works better than the other? For which kind of data? Storage paradigms: To compare them when testing real data vs. unbiased generated data. To find ways to improve current algorithms for these systems based on our results. Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 8 / 32
  • 9. Introduction Existing work Own work Short demo Future work Questions Table of Contents 1 Introduction 2 Existing work 3 Own work 4 Short demo 5 Future work 6 Questions Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 9 / 32
  • 10. Introduction Existing work Own work Short demo Future work Questions Existing Work Description of the NoSQL “movement” and some of its features. Quick overview of some NoSQL popular projects and their particular characteristics. Currently available Graph databases. Description of Blueprints: a tool that uses a software-graph vision for the unification of differently implemented graph databases. Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 10 / 32
  • 11. Introduction Existing work Own work Short demo Future work Questions NoSQL: Short history 1998: Carlo Strozzi started the ”NoSQL” project, a relational database without SQL interface. The project failed. He said that the problem was not on SQL, but on relational model. ”NoRel” ? NoSQL became then “Not Only SQL”. 1999: First NoSQL Conference, called no:sql(east) Worst tagline ever: “select fun, profit from real world where relational=false” Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 11 / 32
  • 12. Introduction Existing work Own work Short demo Future work Questions NoSQL: What is it? ”Not Only SQL” may also be easily seen as “Non-relational”. That said, the NoSQL community around the world may also be easily seen as the union of different databases management systems that differ from the classic relational database model from Codd. Some of those systems may be put together into groups: Key-Value store, Object databases, XML databases, Graph databases, Document store. (Document as in JSON) Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 12 / 32
  • 13. Introduction Existing work Own work Short demo Future work Questions CAP Theorem History: 2000: Conjecture made by Eric Brewer. 2002: Formally proved by MIT scientists. Theorem: “A distributed storage system can satisfy any two of the following guarantees at the same time, but not all three:” Consistency. Availability. Partition tolerance. Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 13 / 32
  • 14. Introduction Existing work Own work Short demo Future work Questions CAP Theorem Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 14 / 32
  • 15. Introduction Existing work Own work Short demo Future work Questions NoSQL: Popular NoSQL projects Some NoSQL have become popular for being supported by big IT companies: Google BigTable (HBase implementation for Hadoop) Storage: Distributed multidimensional maps, Querying: GQL. Project Voldemort (Linkedin) Storage: Distributed Key-Value, Querying: Hashtable semantics. Apache Cassandra (Facebook,Digg) Storage: Distributed Key-Value, Querying: SQL-like. MongoDB (foursquare) Storage: BSON (Binary JSON) Documents, Querying: SQL-like. Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 15 / 32
  • 16. Introduction Existing work Own work Short demo Future work Questions NoSQL: Current graph DBs There are today several different implementations of Graph databases available for commercial and personal use: Neo4J, InfiniteGraph, sones, DEX, OrientDB, HyperGraphDB, and others... Some can however be regrouped since they do implement a “common” graph model: Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 16 / 32
  • 17. Introduction Existing work Own work Short demo Future work Questions Property Graph Model [Source: UIM-GDB “Connections to the real world” presentation] The Property Graph model: works-for {since : 23th Feb; contract : cdd} {id : 1; name : bob} {id : 2; name : tom} Characteristics of the model: directed: Each edge has a source and destination vertex. attributed: Vertices and edges carry key-value pairs. edge-labeled: The label denotes the type of the relationship. multigraph: Multiple edges between any two vertices allowed. Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 17 / 32
  • 18. Introduction Existing work Own work Short demo Future work Questions Blueprints project Blueprints is a project from TinkerPop team. (±7 developers) Collections of interfaces and implementations for the Property Graph Model. Brings interoperability between Graph DBs by allowing developers to plug-and-play several Graph DB backends. Also unifies the indexing and transactions methods for those backends. Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 18 / 32
  • 19. Introduction Existing work Own work Short demo Future work Questions TinkerPop Team Other projects from the team that has created Blueprints: Gremlin: A graph traversal language based on Blueprints implementations. Rexster: Web service providing basic REST (GET, POST...) methods for stored Blueprints graphs. Graphdb-Bench: A Graph DB benchmarking tool that allows one to define and run benchmarks against different Graph DB implementations. Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 19 / 32
  • 20. Introduction Existing work Own work Short demo Future work Questions Table of Contents 1 Introduction 2 Existing work 3 Own work 4 Short demo 5 Future work 6 Questions Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 20 / 32
  • 21. Introduction Existing work Own work Short demo Future work Questions Details of the work Detailed below, the first steps of my work on the subject: To find all the potential storage methods and implementations that could be interesting to test against the classic methods. To design an architecture with a logic based vision that regroups different storage systems and enable us to perform operations over each one of them using an unique language. To write code clean enough to enable our architecture to scale up horizontally and vertically without having to restart all over again. To have an abstract platform solid enough that could regroup previous and future works of GraphIK team within an only software architecture. Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 21 / 32
  • 22. Introduction Existing work Own work Short demo Future work Questions Problems ahead Not every NoSQL, and principally Graph DB implementation available is free or open source. (not easily testable then) Some Graph DBs may be easily plugged because of the Blueprints project, others, less easily. There is not actually a global notion of interoperability between those DBs yet. Reapparition of an old dilemma in software engineering: More general code vs. Specific and efficient code. Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 22 / 32
  • 23. Introduction Existing work Own work Short demo Future work Questions API Architecture Figure: Current class diagram of our architecture. Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 23 / 32
  • 24. Introduction Existing work Own work Short demo Future work Questions Application description Implementation of an IFact interface that ensure that all the storage systems plugged will answer to the same methods. Implementation of Atom, Term and Predicate classes that allows all structures to communicate using a common type of data. In the case of our work, we have decided to use, by default, the fact that if two terms do have the same label, then they represent the same entity. Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 24 / 32
  • 25. Introduction Existing work Own work Short demo Future work Questions Technical details Due to the facility of plugging several different pieces of code together, we have decided to write our framework in JAVA. Pros: Almost everything can be easily plugged in. Cons: Loss in speed and efficiency. For now, our structure can handle a Relational DB (Sqlite) and adjacency lists stored graph. (in memory) We have then written a simple command-line interface in order to interact with the framework. (See demo next) The command-line interface instantiates an (almost) empty KB and has several commands for storing and querying our pieces of information. Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 25 / 32
  • 26. Introduction Existing work Own work Short demo Future work Questions Table of Contents 1 Introduction 2 Existing work 3 Own work 4 Short demo 5 Future work 6 Questions Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 26 / 32
  • 27. Introduction Existing work Own work Short demo Future work Questions Screenshot of the application Figure: Application running on Ubuntu Linux. Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 27 / 32
  • 28. Introduction Existing work Own work Short demo Future work Questions Table of Contents 1 Introduction 2 Existing work 3 Own work 4 Short demo 5 Future work 6 Questions Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 28 / 32
  • 29. Introduction Existing work Own work Short demo Future work Questions Near future We can list some things that we are looking to do in the near future: To run all the tests needed to confirm or to discard our initial hypothesis. To add to our framework the possibility of managing RDF and 3-Store data and to compare them aswell. To start testing our framework structure with larger and larger datasets. To start identifying which storage method is more appropriated than other according to the kind or structure of the data we are dealing with. Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 29 / 32
  • 30. Introduction Existing work Own work Short demo Future work Questions Then after... And also some that we will surely look in a not-so-near future: To implement a benchmarking tool that would help us to compare properly the storage systems we have when dealing with real large datasets against unbiased generated ones. To identify more operations/use cases for which we can compare our data structures. To develop/build a Graphic User Interface on top of our framework. Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 30 / 32
  • 31. Introduction Existing work Own work Short demo Future work Questions Table of Contents 1 Introduction 2 Existing work 3 Own work 4 Short demo 5 Future work 6 Questions Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 31 / 32
  • 32. Introduction Existing work Own work Short demo Future work Questions (More) Questions & (more) comments... Large Knowledge Bases PAIVA LIMA DA SILVA Bruno (Univ. Montpellier 2) 32 / 32