SlideShare une entreprise Scribd logo
1  sur  38
Télécharger pour lire hors ligne
Bio4j: A pioneer graph based
         database for the integration of
                  biological Big Data




www.ohnosequences.com                   www.bio4j.com
What’s Bio4j?
     Bio4j is a bioinformatics graph based DB including most data
     available in :
            Uniprot (SwissProt + Trembl)

            Gene Ontology (GO)

            UniRef (50,90,100)

            NCBI Taxonomy

            RefSeq

            Enzyme DB




www.ohnosequences.com                                      www.bio4j.com
What’s Bio4j?

     It provides a completely new and powerful framework
     for protein related information querying and
     management.


     Since it relies on a high-performance graph engine, data
     is stored in a way that semantically represents its own
     structure




www.ohnosequences.com                                www.bio4j.com
What’s Bio4j?

     Bio4j uses Neo4j technology, a "high-performance graph
     engine with all the features of a mature and robust
     database".

     Thanks to both being based on Neo4j DB and the API
     provided, Bio4j is also very scalable, allowing anyone
     to easily incorporate his own data making the best
     out of it.



www.ohnosequences.com                                 www.bio4j.com
What’s Bio4j?


                        Everything in Bio4j is open source !



       released under AGPLv3




www.ohnosequences.com                              www.bio4j.com
Bioinformatics       Highly interconnected overlapping knowledge
DBs and Graphs       spread throughout different DBs

Initial motivation


Bio4j structure


Some samples


Why Bio4j?


Bio4j and the
Cloud


Upcoming features




www.ohnosequences.com                                              www.bio4j.com
Bioinformatics       However all this data is in most cases modeled in relational databases.
DBs and Graphs       Sometimes even just as plain CSV files

Initial motivation          As the amount and diversity of data grows, domain models
                            become crazily complicated!
Bio4j structure


Some samples


Why Bio4j?


Bio4j and the
Cloud


Upcoming features




www.ohnosequences.com                                                     www.bio4j.com
Bioinformatics       With a relational paradigm, the double implication
DBs and Graphs
                                          Entity  Table
Initial motivation
                     does not go both ways.

Bio4j structure
                          You get „auxiliary‟ tables that have no relationship with the small
                          piece of reality you are modeling.
Some samples

                          You need ‘artificial’ IDs only for connecting entities, (and these are mixed
Why Bio4j?                with IDs that somehow live in reality)


Bio4j and the             Entity-relationship models are cool but in the end you always have to
Cloud                     deal with ‘raw’ tables plus SQL.


                          Integrating/incorporating new knowledge into already existing
Upcoming features
                          databases is hard and sometimes even not possible without changing
                          the domain model




www.ohnosequences.com                                                             www.bio4j.com
Bioinformatics       Life in general and biology in particular are probably not 100% like a graph…
DBs and Graphs


Initial motivation


Bio4j structure


Some samples


Why Bio4j?


Bio4j and the
Cloud


Upcoming features




                                         but one thing’s sure, they are not a set of tables!




www.ohnosequences.com                                                             www.bio4j.com
Bioinformatics
DBs and Graphs
                     NoSQL    (not only SQL)


Initial motivation
                     NoSQ… what !??
Bio4j structure


Some samples         Let’s see what Wikipedia says…


Why Bio4j?                “NoSQL is a broad class of database management systems
                          that differ from the classic model of the relational database
Bio4j and the
Cloud                     management system (RDBMS) in some significant ways.
                          These data stores may not require fixed table schemas,
Upcoming features         usually avoid join operations and typically scale
                          horizontally.”




www.ohnosequences.com                                                    www.bio4j.com
Bioinformatics       NoSQL data models
DBs and Graphs


Initial motivation


Bio4j structure


Some samples


Why Bio4j?


Bio4j and the
Cloud


Upcoming features




www.ohnosequences.com                    www.bio4j.com
Bioinformatics
DBs and Graphs


Initial motivation

                     Cassandra is a highly scalable, eventually consistent,
Bio4j structure      distributed, structured key-value store. Cassandra brings
                     together the distributed systems technologies from Dynamo and the
                     data model from Google's BigTable.
Some samples


Why Bio4j?


Bio4j and the
Cloud
                     MongoDB (from "humongous") is an open source document-
                     oriented NoSQL database system written in the C++ programming
Upcoming features    language.




www.ohnosequences.com                                                  www.bio4j.com
Bioinformatics
DBs and Graphs


Initial motivation

                     Neo4j is a high-performance, NOSQL graph database with all
Bio4j structure
                     the features of a mature and robust database.

Some samples
                     The programmer works with an object-oriented, flexible
                     network structure rather than with strict and static tables
Why Bio4j?


Bio4j and the        All the benefits of a fully transactional, enterprise-strength
Cloud                database.


Upcoming features    For many applications, Neo4j offers performance
                     improvements on the order of 1000x or more compared to
                     relational DBs.




www.ohnosequences.com                                                 www.bio4j.com
Bioinformatics DBs
and Graphs
                     Ok, but why starting all this?
                     Were you so bored…?!
Initial
motivation
                     It all started somehow around our need for massive access to
                     protein GO (Gene Ontology) annotations.
Bio4j structure
                     At that point I had to develop my own MySQL DB based on the official
                     GO SQL database, and problems started from the beginning:
Some samples

                          I got crazy ‘deciphering’ how to extract Uniprot protein annotations
Why Bio4j?                from GO official tables schema


Bio4j and the             Uniprot and GO official protein annotations were not always consistent
Cloud
                          Populating my own DB took really long due to all the joins and
                          subqueries needed in order to get and store the protein annotations.
Upcoming features
                          Soon enough we also had the need of having massive access to basic
                          protein information.




www.ohnosequences.com                                                              www.bio4j.com
Bioinformatics DBs
                     These processes had to be automated for our (specifically
and Graphs
                     designed for NGS data) bacterial genome annotation system
Initial              BG7
motivation

                           Uniprot web services available were too limited:
Bio4j structure
                            - Slow
Some samples
                            - Number of queries limitation

Why Bio4j?                  - Too little information available

Bio4j and the
Cloud

                              So I downloaded the whole Uniprot DB in XML format
Upcoming features             (Swiss-Prot + Trembl)

                              and started to have some fun with it !




www.ohnosequences.com                                                  www.bio4j.com
Bioinformatics DBs   We got used to having massive direct access to all this protein
and Graphs           related information…

Initial
motivation                So why not adding other resources we needed quite often
                          in most projects and which now were becoming a sort of
                          bottleneck compared to all those already included in Bio4j ?
Bio4j structure

                     Then came:
Some samples
                           -   Isoform sequences

Why Bio4j?                 -   Protein interactions and features

                           -   Uniref 50, 90, and 100
Bio4j and the
Cloud                      -   RefSeq

                           -   NCBI Taxonomy
Upcoming features
                           -   Enzyme Expasy DB




www.ohnosequences.com                                                   www.bio4j.com
Bioinformatics DBs   Let’s dig a bit about Bio4j structure:
and Graphs


Initial motivation   Data sources and their relationships:


Bio4j structure


Some samples


Why Bio4j?


Bio4j and the
Cloud


Upcoming features




www.ohnosequences.com                                         www.bio4j.com
Bioinformatics DBs
and Graphs           The Graph DB model: representation

Initial motivation
                     Core abstractions:
Bio4j structure         Nodes

                        Relationships between nodes
Some samples
                        Properties on both
Why Bio4j?


Bio4j and the
Cloud


Upcoming features




www.ohnosequences.com                                     www.bio4j.com
Bioinformatics DBs   Let’s dig a bit about Bio4j structure:
and Graphs


Initial motivation              How are things modeled?

Bio4j structure

                                     Couldn’t be simpler!
Some samples


Why Bio4j?

                          Entities           Associations / Relationships
Bio4j and the
Cloud


Upcoming features
                          Nodes                         Edges




www.ohnosequences.com                                             www.bio4j.com
Bioinformatics DBs   Some examples of nodes would be:
and Graphs


Initial motivation                             GO term
                           Protein
Bio4j structure
                                                              Genome Element

Some samples


Why Bio4j?
                     and relationships:

Bio4j and the
Cloud
                                     Protein   PROTEIN_GO_ANNOTATION

Upcoming features
                                                            GO term




www.ohnosequences.com                                                  www.bio4j.com
Bioinformatics DBs   We have developed a tool aimed to be used both as a reference manual and
and Graphs           initial contact for Bio4j domain model: Bio4jExplorer

                     Bio4jExplorer allows you to:
Initial motivation
                     • Navigate through all nodes and relationships

Bio4j structure
                     • Access the javadocs of any node or relationship

Some samples
                     • Graphically explore the neighborhood of a node/relationship

Why Bio4j?
                     • Look up for the indexes that may serve as an entry point for a node

Bio4j and the
Cloud                • Check incoming/outgoing relationships of a specific node


Upcoming features    • Check start/end nodes of a specific relationship




www.ohnosequences.com                                                          www.bio4j.com
Bioinformatics DBs   Entry points and indexing
and Graphs

                     There are two kinds of entry points for the graph:
Initial motivation


Bio4j structure            Auxiliary relationships going from the reference node, e.g.

                             - CELLULAR_COMPONENT: leads to the root of GO cellular component
Some samples                 sub-ontology

                             - MAIN_DATASET: leads to both main datasets: Swiss-Prot and Trembl
Why Bio4j?

                           Node indexing
Bio4j and the
Cloud                      There are two types of node indexes:

                             - Exact: Only exact values are considered hits
Upcoming features
                             - Fulltext: Regular expressions can be used




www.ohnosequences.com                                                          www.bio4j.com
Bioinformatics DBs   Retrieving protein info (Bio4jModel Java API)
and Graphs
                     //--creating manager and node retriever----
                     Bio4jManager manager = new Bio4jManager(“/mybio4jdb”);
Initial motivation   NodeRetriever nR= new NodeRetriever(manager);

                     ProteinNode protein = nR.getProteinNodeByAccession(“P12345”);
Bio4j structure
                     Getting more related info...
Some samples
                     List<InterproNode> interpros = protein.getInterpro();
                     OrganismNode organism = protein.getOrganism();
                     List<GoTermNode> goAnnotations = protein.getGOAnnotations();
Why Bio4j?
                     List<ArticleNode> articles = protein.getArticleCitations();
Bio4j and the
                     for (ArticleNode article : articles) {
Cloud
                         System.out.println(article.getPubmedId());
                     }
Upcoming features
                     //And don’t forget to close the Bio4jManager
                     manager.shutDown();




www.ohnosequences.com                                                     www.bio4j.com
Bioinformatics DBs   Proteins with Interpro motif ‘IPR000847’ (Bio4jModel Java API)
and Graphs

                     //--creating manager and node retriever----
Initial motivation   Bio4jManager manager = new Bio4jManager(“/mybio4jdb”);
                     NodeRetriever nR= new NodeRetriever(manager);

Bio4j structure      InterproNode interpro = nR.getInterproById(“IPR000847”);
                     ProteinInterproRel rel = ProteinInterproRel(null);

Some samples         Iterator<Relationship> iterator =
                            interpro.getNode().getRelationships(rel, Direction.INCOMING);

Why Bio4j?           while(relIterator.hasNext()){
                         ProteinNode p = new ProteinNode(iterator.next().getStartNode());
                         System.out.println(p.getAccession());
Bio4j and the        }
Cloud
                     //And don’t forget to close the Bio4jManager
                     manager.shutDown();
Upcoming features




www.ohnosequences.com                                                     www.bio4j.com
Bioinformatics DBs          Querying Bio4j with Cypher
and Graphs


Initial motivation
                     Getting a keyword by its ID

Bio4j structure      START k=node:keyword_id_index(keyword_id_index = "KW-0181")
                     return k.name, k.id

Some samples
                     Finding circuits/simple cycles of length 3 where at least one protein is from
                     Swiss-Prot dataset:
Why Bio4j?
                     START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot")
                     MATCH d <-[r:PROTEIN_DATASET]- p,
Bio4j and the
                     circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-> (p2) -
Cloud
                     [:PROTEIN_PROTEIN_INTERACTION]-> (p3) -
                     [:PROTEIN_PROTEIN_INTERACTION]-> (p)
                      return p.accession, p2.accession, p3.accession
Upcoming features

                              Check this blog post for more info and our Bio4j Cypher cheetsheet




www.ohnosequences.com                                                                 www.bio4j.com
Bioinformatics DBs
and Graphs


Initial motivation

                     Get protein by its accession number and return its full name
Bio4j structure

                      gremlin> g.idx('protein_accession_index')[['protein_accession_index':'P12345']].full_name
Some samples          ==> Aspartate aminotransferase, mitochondrial


                     Get proteins (accessions) associated to an interpro motif (limited to 4 results)
Why Bio4j?
                      gremlin>
                      g.idx('interpro_id_index')[['interpro_id_index':'IPR023306']].inE('PROTEIN_INTERPRO').outV
Bio4j and the         .accession[0..3]
Cloud                 ==> E2GK26
                      ==> G3PMS4
                      ==> G3Q865
Upcoming features     ==> G3PIL8


                             Check our Bio4j Gremlin cheetsheet




www.ohnosequences.com                                                                        www.bio4j.com
Bioinformatics DBs
and Graphs                       REST Server

Initial motivation
                     You can also query/navigate through Bio4j with the REST API !
Bio4j structure
                     The default representation is json, both for responses and or data sent with
                     POST/PUT requests
Some samples

                     Get protein by its accession number: (Q9UR66)
Why Bio4j?
                     http://server_url:7474/db/data/index/node/protein_acc
                     ession_index/protein_accession_index/Q9UR66
Bio4j and the
Cloud

                     Get outgoing relationships for protein Q9UR66
Upcoming features
                     http://server_url:7474/db/data/node/Q9UR66_node_id/re
                     lationships/out




www.ohnosequences.com                                                              www.bio4j.com
Bioinformatics DBs   Visualizations (1)  REST Server Data Browser
and Graphs

                     Navigate through Bio4j data in real time !
Initial motivation


Bio4j structure


Some samples


Why Bio4j?


Bio4j and the
Cloud


Upcoming features




www.ohnosequences.com                                                www.bio4j.com
Bioinformatics DBs   Visualizations (2)  Bio4j + Gephi
and Graphs

                     Get really cool graph visualizations using Bio4j and Gephi visualization and
Initial motivation   exploration platform


Bio4j structure


Some samples


Why Bio4j?


Bio4j and the
Cloud


Upcoming features




www.ohnosequences.com                                                                www.bio4j.com
Bioinformatics DBs   Visualizations (3)  Bio4j GO Tools
and Graphs


Initial motivation


Bio4j structure


Some samples


Why Bio4j?


Bio4j and the
Cloud


Upcoming features




www.ohnosequences.com                                      www.bio4j.com
Bioinformatics DBs   Why would I use Bio4j ?
and Graphs

                     Massive access to protein/genome/taxonomy… related
Initial motivation   information

Bio4j structure      Integration of your own DBs/resources around common
                     information
Some samples
                     Development of services tailored to your needs built around
Why Bio4j?
                     Bio4j


Bio4j and the
                     Networks analysis
Cloud
                     Visualizations
Upcoming features
                     Besides many others I cannot think of myself…
                     If you have something in mind for which Bio4j might be useful, please let
                     us know so we can all see how it could help you meet your needs! ;)




www.ohnosequences.com                                                             www.bio4j.com
Bioinformatics DBs   Bio4j + Cloud (1)
and Graphs

                     We use AWS (Amazon Web Services) everywhere we can around
Initial motivation
                     Bio4j, giving us the following benefits:

Bio4j structure
                          Interoperability and data distribution

Some samples              Releases are available as public EBS Snapshots, giving AWS users
                          the opportunity of creating and attaching to their instances Bio4j DB
                          100% ready volumes in just a few seconds.
Why Bio4j?


Bio4j and the             CloudFormation templates:
Cloud
                             - Basic Bio4j DB Instance

Upcoming features            - Bio4j REST Server Instance




www.ohnosequences.com                                                           www.bio4j.com
Bioinformatics DBs   Bio4j + Cloud (2)
and Graphs


Initial motivation       Backup and Storage using S3 (Simple Storage Service)

                          We use S3 both for backup (indirectly through the EBS snapshots) and
Bio4j structure           storage (directly storing RefSeq sequences as independent S3 files)

                          What kind of benefits do we get from this?
Some samples
                             • Easy to use

Why Bio4j?                   • Flexible

                             • Cost-Effective
Bio4j and the
Cloud                        • Reliable

                             • Scalable and high-performance
Upcoming features
                             • Secure




www.ohnosequences.com                                                          www.bio4j.com
Bioinformatics DBs   Bio4j + Cloud (3)
and Graphs


Initial motivation       Web servers and service providers in the cloud

                          Deploying your own web server in AWS using Bio4j as back-end is really
Bio4j structure           simple.

                          A good example of this would be Bio4jTestServer, a continuously
Some samples              developed server showcasing Web Services based on Bio4j.


Why Bio4j?


Bio4j and the
Cloud


Upcoming features




www.ohnosequences.com                                                          www.bio4j.com
Bioinformatics DBs
and Graphs
                     Upcoming features

                     - Relationship indexing for relationships going and coming from supernodes
Initial motivation
                      No one’s perfect, and Bio4j is not the exception.
                      Relationship fetching can become a bottleneck whenever you have to deal
Bio4j structure       with supernodes (unless you index these relationships). Fortunately this is
                      something that Neo4j is going to fix in the next version(s).

Some samples
                     - More resources available (Reactome…)

Why Bio4j?           - Improvements in the importing process


                     - A more complete version of Bio4jModel
Bio4j and the
Cloud                  Allowing users to perform almost all sorts of queries without having to worry
                       about Neo4j core API.

Upcoming             - New tools, services and visualizations built around Bio4j
features




www.ohnosequences.com                                                              www.bio4j.com
Bioinformatics DBs   Community
and Graphs

                     Bio4j has a fast growing internet presence:
Initial motivation


Bio4j structure       - Twitter: check @bio4j for updates

                      - Blog: go to http://blog.bio4j.com
Some samples

                      - Mail-list: ask any question you may have in our list.
Why Bio4j?

                      - LinkedIn: check the Bio4j group
Bio4j and the
Cloud
                      - Github issues: don’t be shy! open a new issue if you think
                                       something’s going wrong.
Upcoming features




www.ohnosequences.com                                                     www.bio4j.com
Bioinformatics DBs   and... Who’s behind all this?
and Graphs

                     Bio4j is being developed by Oh no sequences! Team and
Initial motivation   Era7 Bioinformatics members:

Bio4j structure
                      - Pablo Pareja Tobes: Main developer (that’s me!)

Some samples
                      - Eduardo Pareja Tobes: Technology and architecture main advisor
Why Bio4j?

                      - Raquel Tobes: Bioinformatics main advisor
Bio4j and the
Cloud

                      - Marina Manrique: Bioinformatics support
Upcoming features

                      - Eduardo Pareja: Scientific advisor




www.ohnosequences.com                                                     www.bio4j.com
Bioinformatics DBs
and Graphs


Initial motivation


Bio4j structure
                        That’s it !

Some samples


Why Bio4j?
                        Thanks for
                        your time ;)
Bio4j and the
Cloud


Upcoming features




www.ohnosequences.com                  www.bio4j.com

Contenu connexe

Tendances

Open hpi semweb-06-part4
Open hpi semweb-06-part4Open hpi semweb-06-part4
Open hpi semweb-06-part4Nadine Ludwig
 
Ontology development in protégé-آنتولوژی در پروتوغه
Ontology development in protégé-آنتولوژی در پروتوغهOntology development in protégé-آنتولوژی در پروتوغه
Ontology development in protégé-آنتولوژی در پروتوغهsadegh salehi
 
VIVO 2011 OpenSocial and RDF Poster
VIVO 2011 OpenSocial and RDF PosterVIVO 2011 OpenSocial and RDF Poster
VIVO 2011 OpenSocial and RDF Posterericmeeks
 
Open hpi semweb-06-part3
Open hpi semweb-06-part3Open hpi semweb-06-part3
Open hpi semweb-06-part3Nadine Ludwig
 
Open hpi semweb-06-part2
Open hpi semweb-06-part2Open hpi semweb-06-part2
Open hpi semweb-06-part2Nadine Ludwig
 
How SADI & SHARE help restore the Scientific Method to in silico science
How SADI & SHARE help restore the Scientific Method to in silico scienceHow SADI & SHARE help restore the Scientific Method to in silico science
How SADI & SHARE help restore the Scientific Method to in silico scienceMark Wilkinson
 
Chado for evolutionary biology
Chado for evolutionary biologyChado for evolutionary biology
Chado for evolutionary biologyChris Mungall
 
Libraries and Linked Data: Looking to the Future (2)
Libraries and Linked Data: Looking to the Future (2)Libraries and Linked Data: Looking to the Future (2)
Libraries and Linked Data: Looking to the Future (2)ALATechSource
 
Libraries and Linked Data: Looking to the Future (3)
Libraries and Linked Data: Looking to the Future (3)Libraries and Linked Data: Looking to the Future (3)
Libraries and Linked Data: Looking to the Future (3)ALATechSource
 
From OBO to OWL and back - building scalable ontologies
From OBO to OWL and back - building scalable ontologiesFrom OBO to OWL and back - building scalable ontologies
From OBO to OWL and back - building scalable ontologiesdosumis
 

Tendances (13)

Open hpi semweb-06-part4
Open hpi semweb-06-part4Open hpi semweb-06-part4
Open hpi semweb-06-part4
 
Ontology development in protégé-آنتولوژی در پروتوغه
Ontology development in protégé-آنتولوژی در پروتوغهOntology development in protégé-آنتولوژی در پروتوغه
Ontology development in protégé-آنتولوژی در پروتوغه
 
VIVO 2011 OpenSocial and RDF Poster
VIVO 2011 OpenSocial and RDF PosterVIVO 2011 OpenSocial and RDF Poster
VIVO 2011 OpenSocial and RDF Poster
 
Open hpi semweb-06-part3
Open hpi semweb-06-part3Open hpi semweb-06-part3
Open hpi semweb-06-part3
 
Open hpi semweb-06-part2
Open hpi semweb-06-part2Open hpi semweb-06-part2
Open hpi semweb-06-part2
 
How SADI & SHARE help restore the Scientific Method to in silico science
How SADI & SHARE help restore the Scientific Method to in silico scienceHow SADI & SHARE help restore the Scientific Method to in silico science
How SADI & SHARE help restore the Scientific Method to in silico science
 
The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...The importance of the InChI identifier as a foundation technology for eScienc...
The importance of the InChI identifier as a foundation technology for eScienc...
 
Chado for evolutionary biology
Chado for evolutionary biologyChado for evolutionary biology
Chado for evolutionary biology
 
Libraries and Linked Data: Looking to the Future (2)
Libraries and Linked Data: Looking to the Future (2)Libraries and Linked Data: Looking to the Future (2)
Libraries and Linked Data: Looking to the Future (2)
 
Chado introduction
Chado introductionChado introduction
Chado introduction
 
Libraries and Linked Data: Looking to the Future (3)
Libraries and Linked Data: Looking to the Future (3)Libraries and Linked Data: Looking to the Future (3)
Libraries and Linked Data: Looking to the Future (3)
 
From OBO to OWL and back - building scalable ontologies
From OBO to OWL and back - building scalable ontologiesFrom OBO to OWL and back - building scalable ontologies
From OBO to OWL and back - building scalable ontologies
 
Chado-XML
Chado-XMLChado-XML
Chado-XML
 

Similaire à Bio4j: A pioneer graph based database for the integration of biological Big Data

A semantic framework for biomedical image discovery
A semantic framework for biomedical image discoveryA semantic framework for biomedical image discovery
A semantic framework for biomedical image discoverySyed Ahmad Chan Bukhari, PhD
 
Life Science Database Cross Search and Metadata
Life Science Database Cross Search and MetadataLife Science Database Cross Search and Metadata
Life Science Database Cross Search and MetadataMaori Ito
 
Prlic Bio Java Bosc2009
Prlic Bio Java Bosc2009Prlic Bio Java Bosc2009
Prlic Bio Java Bosc2009bosc
 
OWL-XML-Summer-School-09
OWL-XML-Summer-School-09OWL-XML-Summer-School-09
OWL-XML-Summer-School-09Duncan Hull
 
an-introduction-to-relational-database-theory.pdf
an-introduction-to-relational-database-theory.pdfan-introduction-to-relational-database-theory.pdf
an-introduction-to-relational-database-theory.pdfbrilliantkashuware
 
5. Building the Cancer Research Data Commons with Neo4j: The Bento Framework
5. Building the Cancer Research Data Commons with Neo4j: The Bento Framework5. Building the Cancer Research Data Commons with Neo4j: The Bento Framework
5. Building the Cancer Research Data Commons with Neo4j: The Bento FrameworkNeo4j
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8dallemang
 
Use Integrated Genome Browser to explore, analyze, and publish genomic data
Use Integrated Genome Browser to explore, analyze, and publish genomic dataUse Integrated Genome Browser to explore, analyze, and publish genomic data
Use Integrated Genome Browser to explore, analyze, and publish genomic dataAnn Loraine
 
2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...Michel Dumontier
 
BITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS
 
BioPerl (Poster T02, ISMB 2010)
BioPerl (Poster T02, ISMB 2010)BioPerl (Poster T02, ISMB 2010)
BioPerl (Poster T02, ISMB 2010)Mark Jensen
 
BioPerl (Poster T02, ISMB 2010)
BioPerl (Poster T02, ISMB 2010)BioPerl (Poster T02, ISMB 2010)
BioPerl (Poster T02, ISMB 2010)Mark Jensen
 
Sssc2011 ontologies final
Sssc2011 ontologies finalSssc2011 ontologies final
Sssc2011 ontologies finalElena Simperl
 
BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features
BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future FeaturesBioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features
BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future FeaturesHilmar Lapp
 
covo.js : A JavaScript Library to Utilize Subject Headings and Thesauri on th...
covo.js : A JavaScript Library to Utilize Subject Headings and Thesauri on th...covo.js : A JavaScript Library to Utilize Subject Headings and Thesauri on th...
covo.js : A JavaScript Library to Utilize Subject Headings and Thesauri on th...Shun Nagaya
 

Similaire à Bio4j: A pioneer graph based database for the integration of biological Big Data (20)

A semantic framework for biomedical image discovery
A semantic framework for biomedical image discoveryA semantic framework for biomedical image discovery
A semantic framework for biomedical image discovery
 
Life Science Database Cross Search and Metadata
Life Science Database Cross Search and MetadataLife Science Database Cross Search and Metadata
Life Science Database Cross Search and Metadata
 
Prlic Bio Java Bosc2009
Prlic Bio Java Bosc2009Prlic Bio Java Bosc2009
Prlic Bio Java Bosc2009
 
OWL-XML-Summer-School-09
OWL-XML-Summer-School-09OWL-XML-Summer-School-09
OWL-XML-Summer-School-09
 
an-introduction-to-relational-database-theory.pdf
an-introduction-to-relational-database-theory.pdfan-introduction-to-relational-database-theory.pdf
an-introduction-to-relational-database-theory.pdf
 
5. Building the Cancer Research Data Commons with Neo4j: The Bento Framework
5. Building the Cancer Research Data Commons with Neo4j: The Bento Framework5. Building the Cancer Research Data Commons with Neo4j: The Bento Framework
5. Building the Cancer Research Data Commons with Neo4j: The Bento Framework
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
 
Use Integrated Genome Browser to explore, analyze, and publish genomic data
Use Integrated Genome Browser to explore, analyze, and publish genomic dataUse Integrated Genome Browser to explore, analyze, and publish genomic data
Use Integrated Genome Browser to explore, analyze, and publish genomic data
 
Oracle Data Warehouse
Oracle Data WarehouseOracle Data Warehouse
Oracle Data Warehouse
 
Oracle Warehouse
Oracle WarehouseOracle Warehouse
Oracle Warehouse
 
2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...
 
BITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequencesBITS: Overview of important biological databases beyond sequences
BITS: Overview of important biological databases beyond sequences
 
BioPerl (Poster T02, ISMB 2010)
BioPerl (Poster T02, ISMB 2010)BioPerl (Poster T02, ISMB 2010)
BioPerl (Poster T02, ISMB 2010)
 
BioPerl (Poster T02, ISMB 2010)
BioPerl (Poster T02, ISMB 2010)BioPerl (Poster T02, ISMB 2010)
BioPerl (Poster T02, ISMB 2010)
 
Knetminer Backend Training, Nov 2018
Knetminer Backend Training, Nov 2018Knetminer Backend Training, Nov 2018
Knetminer Backend Training, Nov 2018
 
Sssc2011 ontologies final
Sssc2011 ontologies finalSssc2011 ontologies final
Sssc2011 ontologies final
 
NCBO Technology
NCBO TechnologyNCBO Technology
NCBO Technology
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
 
BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features
BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future FeaturesBioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features
BioSQL Reloaded: v1.0 Release, PhyloDB Module, and Future Features
 
covo.js : A JavaScript Library to Utilize Subject Headings and Thesauri on th...
covo.js : A JavaScript Library to Utilize Subject Headings and Thesauri on th...covo.js : A JavaScript Library to Utilize Subject Headings and Thesauri on th...
covo.js : A JavaScript Library to Utilize Subject Headings and Thesauri on th...
 

Dernier

Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQuiz Club NITW
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvRicaMaeCastro1
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptxDhatriParmar
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxkarenfajardo43
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseCeline George
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...DhatriParmar
 
4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptxmary850239
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdfMr Bounab Samir
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxMichelleTuguinay1
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Developmentchesterberbo7
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research DiscourseAnita GoswamiGiri
 

Dernier (20)

Q-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITWQ-Factor General Quiz-7th April 2024, Quiz Club NITW
Q-Factor General Quiz-7th April 2024, Quiz Club NITW
 
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 Database
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
 
4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdf
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
 
Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Development
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research Discourse
 

Bio4j: A pioneer graph based database for the integration of biological Big Data

  • 1. Bio4j: A pioneer graph based database for the integration of biological Big Data www.ohnosequences.com www.bio4j.com
  • 2. What’s Bio4j? Bio4j is a bioinformatics graph based DB including most data available in : Uniprot (SwissProt + Trembl) Gene Ontology (GO) UniRef (50,90,100) NCBI Taxonomy RefSeq Enzyme DB www.ohnosequences.com www.bio4j.com
  • 3. What’s Bio4j? It provides a completely new and powerful framework for protein related information querying and management. Since it relies on a high-performance graph engine, data is stored in a way that semantically represents its own structure www.ohnosequences.com www.bio4j.com
  • 4. What’s Bio4j? Bio4j uses Neo4j technology, a "high-performance graph engine with all the features of a mature and robust database". Thanks to both being based on Neo4j DB and the API provided, Bio4j is also very scalable, allowing anyone to easily incorporate his own data making the best out of it. www.ohnosequences.com www.bio4j.com
  • 5. What’s Bio4j? Everything in Bio4j is open source ! released under AGPLv3 www.ohnosequences.com www.bio4j.com
  • 6. Bioinformatics Highly interconnected overlapping knowledge DBs and Graphs spread throughout different DBs Initial motivation Bio4j structure Some samples Why Bio4j? Bio4j and the Cloud Upcoming features www.ohnosequences.com www.bio4j.com
  • 7. Bioinformatics However all this data is in most cases modeled in relational databases. DBs and Graphs Sometimes even just as plain CSV files Initial motivation As the amount and diversity of data grows, domain models become crazily complicated! Bio4j structure Some samples Why Bio4j? Bio4j and the Cloud Upcoming features www.ohnosequences.com www.bio4j.com
  • 8. Bioinformatics With a relational paradigm, the double implication DBs and Graphs Entity  Table Initial motivation does not go both ways. Bio4j structure You get „auxiliary‟ tables that have no relationship with the small piece of reality you are modeling. Some samples You need ‘artificial’ IDs only for connecting entities, (and these are mixed Why Bio4j? with IDs that somehow live in reality) Bio4j and the Entity-relationship models are cool but in the end you always have to Cloud deal with ‘raw’ tables plus SQL. Integrating/incorporating new knowledge into already existing Upcoming features databases is hard and sometimes even not possible without changing the domain model www.ohnosequences.com www.bio4j.com
  • 9. Bioinformatics Life in general and biology in particular are probably not 100% like a graph… DBs and Graphs Initial motivation Bio4j structure Some samples Why Bio4j? Bio4j and the Cloud Upcoming features but one thing’s sure, they are not a set of tables! www.ohnosequences.com www.bio4j.com
  • 10. Bioinformatics DBs and Graphs NoSQL (not only SQL) Initial motivation NoSQ… what !?? Bio4j structure Some samples Let’s see what Wikipedia says… Why Bio4j? “NoSQL is a broad class of database management systems that differ from the classic model of the relational database Bio4j and the Cloud management system (RDBMS) in some significant ways. These data stores may not require fixed table schemas, Upcoming features usually avoid join operations and typically scale horizontally.” www.ohnosequences.com www.bio4j.com
  • 11. Bioinformatics NoSQL data models DBs and Graphs Initial motivation Bio4j structure Some samples Why Bio4j? Bio4j and the Cloud Upcoming features www.ohnosequences.com www.bio4j.com
  • 12. Bioinformatics DBs and Graphs Initial motivation Cassandra is a highly scalable, eventually consistent, Bio4j structure distributed, structured key-value store. Cassandra brings together the distributed systems technologies from Dynamo and the data model from Google's BigTable. Some samples Why Bio4j? Bio4j and the Cloud MongoDB (from "humongous") is an open source document- oriented NoSQL database system written in the C++ programming Upcoming features language. www.ohnosequences.com www.bio4j.com
  • 13. Bioinformatics DBs and Graphs Initial motivation Neo4j is a high-performance, NOSQL graph database with all Bio4j structure the features of a mature and robust database. Some samples The programmer works with an object-oriented, flexible network structure rather than with strict and static tables Why Bio4j? Bio4j and the All the benefits of a fully transactional, enterprise-strength Cloud database. Upcoming features For many applications, Neo4j offers performance improvements on the order of 1000x or more compared to relational DBs. www.ohnosequences.com www.bio4j.com
  • 14. Bioinformatics DBs and Graphs Ok, but why starting all this? Were you so bored…?! Initial motivation It all started somehow around our need for massive access to protein GO (Gene Ontology) annotations. Bio4j structure At that point I had to develop my own MySQL DB based on the official GO SQL database, and problems started from the beginning: Some samples I got crazy ‘deciphering’ how to extract Uniprot protein annotations Why Bio4j? from GO official tables schema Bio4j and the Uniprot and GO official protein annotations were not always consistent Cloud Populating my own DB took really long due to all the joins and subqueries needed in order to get and store the protein annotations. Upcoming features Soon enough we also had the need of having massive access to basic protein information. www.ohnosequences.com www.bio4j.com
  • 15. Bioinformatics DBs These processes had to be automated for our (specifically and Graphs designed for NGS data) bacterial genome annotation system Initial BG7 motivation Uniprot web services available were too limited: Bio4j structure - Slow Some samples - Number of queries limitation Why Bio4j? - Too little information available Bio4j and the Cloud So I downloaded the whole Uniprot DB in XML format Upcoming features (Swiss-Prot + Trembl) and started to have some fun with it ! www.ohnosequences.com www.bio4j.com
  • 16. Bioinformatics DBs We got used to having massive direct access to all this protein and Graphs related information… Initial motivation So why not adding other resources we needed quite often in most projects and which now were becoming a sort of bottleneck compared to all those already included in Bio4j ? Bio4j structure Then came: Some samples - Isoform sequences Why Bio4j? - Protein interactions and features - Uniref 50, 90, and 100 Bio4j and the Cloud - RefSeq - NCBI Taxonomy Upcoming features - Enzyme Expasy DB www.ohnosequences.com www.bio4j.com
  • 17. Bioinformatics DBs Let’s dig a bit about Bio4j structure: and Graphs Initial motivation Data sources and their relationships: Bio4j structure Some samples Why Bio4j? Bio4j and the Cloud Upcoming features www.ohnosequences.com www.bio4j.com
  • 18. Bioinformatics DBs and Graphs The Graph DB model: representation Initial motivation Core abstractions: Bio4j structure Nodes Relationships between nodes Some samples Properties on both Why Bio4j? Bio4j and the Cloud Upcoming features www.ohnosequences.com www.bio4j.com
  • 19. Bioinformatics DBs Let’s dig a bit about Bio4j structure: and Graphs Initial motivation How are things modeled? Bio4j structure Couldn’t be simpler! Some samples Why Bio4j? Entities Associations / Relationships Bio4j and the Cloud Upcoming features Nodes Edges www.ohnosequences.com www.bio4j.com
  • 20. Bioinformatics DBs Some examples of nodes would be: and Graphs Initial motivation GO term Protein Bio4j structure Genome Element Some samples Why Bio4j? and relationships: Bio4j and the Cloud Protein PROTEIN_GO_ANNOTATION Upcoming features GO term www.ohnosequences.com www.bio4j.com
  • 21. Bioinformatics DBs We have developed a tool aimed to be used both as a reference manual and and Graphs initial contact for Bio4j domain model: Bio4jExplorer Bio4jExplorer allows you to: Initial motivation • Navigate through all nodes and relationships Bio4j structure • Access the javadocs of any node or relationship Some samples • Graphically explore the neighborhood of a node/relationship Why Bio4j? • Look up for the indexes that may serve as an entry point for a node Bio4j and the Cloud • Check incoming/outgoing relationships of a specific node Upcoming features • Check start/end nodes of a specific relationship www.ohnosequences.com www.bio4j.com
  • 22. Bioinformatics DBs Entry points and indexing and Graphs There are two kinds of entry points for the graph: Initial motivation Bio4j structure Auxiliary relationships going from the reference node, e.g. - CELLULAR_COMPONENT: leads to the root of GO cellular component Some samples sub-ontology - MAIN_DATASET: leads to both main datasets: Swiss-Prot and Trembl Why Bio4j? Node indexing Bio4j and the Cloud There are two types of node indexes: - Exact: Only exact values are considered hits Upcoming features - Fulltext: Regular expressions can be used www.ohnosequences.com www.bio4j.com
  • 23. Bioinformatics DBs Retrieving protein info (Bio4jModel Java API) and Graphs //--creating manager and node retriever---- Bio4jManager manager = new Bio4jManager(“/mybio4jdb”); Initial motivation NodeRetriever nR= new NodeRetriever(manager); ProteinNode protein = nR.getProteinNodeByAccession(“P12345”); Bio4j structure Getting more related info... Some samples List<InterproNode> interpros = protein.getInterpro(); OrganismNode organism = protein.getOrganism(); List<GoTermNode> goAnnotations = protein.getGOAnnotations(); Why Bio4j? List<ArticleNode> articles = protein.getArticleCitations(); Bio4j and the for (ArticleNode article : articles) { Cloud System.out.println(article.getPubmedId()); } Upcoming features //And don’t forget to close the Bio4jManager manager.shutDown(); www.ohnosequences.com www.bio4j.com
  • 24. Bioinformatics DBs Proteins with Interpro motif ‘IPR000847’ (Bio4jModel Java API) and Graphs //--creating manager and node retriever---- Initial motivation Bio4jManager manager = new Bio4jManager(“/mybio4jdb”); NodeRetriever nR= new NodeRetriever(manager); Bio4j structure InterproNode interpro = nR.getInterproById(“IPR000847”); ProteinInterproRel rel = ProteinInterproRel(null); Some samples Iterator<Relationship> iterator = interpro.getNode().getRelationships(rel, Direction.INCOMING); Why Bio4j? while(relIterator.hasNext()){ ProteinNode p = new ProteinNode(iterator.next().getStartNode()); System.out.println(p.getAccession()); Bio4j and the } Cloud //And don’t forget to close the Bio4jManager manager.shutDown(); Upcoming features www.ohnosequences.com www.bio4j.com
  • 25. Bioinformatics DBs Querying Bio4j with Cypher and Graphs Initial motivation Getting a keyword by its ID Bio4j structure START k=node:keyword_id_index(keyword_id_index = "KW-0181") return k.name, k.id Some samples Finding circuits/simple cycles of length 3 where at least one protein is from Swiss-Prot dataset: Why Bio4j? START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot") MATCH d <-[r:PROTEIN_DATASET]- p, Bio4j and the circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-> (p2) - Cloud [:PROTEIN_PROTEIN_INTERACTION]-> (p3) - [:PROTEIN_PROTEIN_INTERACTION]-> (p) return p.accession, p2.accession, p3.accession Upcoming features Check this blog post for more info and our Bio4j Cypher cheetsheet www.ohnosequences.com www.bio4j.com
  • 26. Bioinformatics DBs and Graphs Initial motivation Get protein by its accession number and return its full name Bio4j structure gremlin> g.idx('protein_accession_index')[['protein_accession_index':'P12345']].full_name Some samples ==> Aspartate aminotransferase, mitochondrial Get proteins (accessions) associated to an interpro motif (limited to 4 results) Why Bio4j? gremlin> g.idx('interpro_id_index')[['interpro_id_index':'IPR023306']].inE('PROTEIN_INTERPRO').outV Bio4j and the .accession[0..3] Cloud ==> E2GK26 ==> G3PMS4 ==> G3Q865 Upcoming features ==> G3PIL8 Check our Bio4j Gremlin cheetsheet www.ohnosequences.com www.bio4j.com
  • 27. Bioinformatics DBs and Graphs REST Server Initial motivation You can also query/navigate through Bio4j with the REST API ! Bio4j structure The default representation is json, both for responses and or data sent with POST/PUT requests Some samples Get protein by its accession number: (Q9UR66) Why Bio4j? http://server_url:7474/db/data/index/node/protein_acc ession_index/protein_accession_index/Q9UR66 Bio4j and the Cloud Get outgoing relationships for protein Q9UR66 Upcoming features http://server_url:7474/db/data/node/Q9UR66_node_id/re lationships/out www.ohnosequences.com www.bio4j.com
  • 28. Bioinformatics DBs Visualizations (1)  REST Server Data Browser and Graphs Navigate through Bio4j data in real time ! Initial motivation Bio4j structure Some samples Why Bio4j? Bio4j and the Cloud Upcoming features www.ohnosequences.com www.bio4j.com
  • 29. Bioinformatics DBs Visualizations (2)  Bio4j + Gephi and Graphs Get really cool graph visualizations using Bio4j and Gephi visualization and Initial motivation exploration platform Bio4j structure Some samples Why Bio4j? Bio4j and the Cloud Upcoming features www.ohnosequences.com www.bio4j.com
  • 30. Bioinformatics DBs Visualizations (3)  Bio4j GO Tools and Graphs Initial motivation Bio4j structure Some samples Why Bio4j? Bio4j and the Cloud Upcoming features www.ohnosequences.com www.bio4j.com
  • 31. Bioinformatics DBs Why would I use Bio4j ? and Graphs Massive access to protein/genome/taxonomy… related Initial motivation information Bio4j structure Integration of your own DBs/resources around common information Some samples Development of services tailored to your needs built around Why Bio4j? Bio4j Bio4j and the Networks analysis Cloud Visualizations Upcoming features Besides many others I cannot think of myself… If you have something in mind for which Bio4j might be useful, please let us know so we can all see how it could help you meet your needs! ;) www.ohnosequences.com www.bio4j.com
  • 32. Bioinformatics DBs Bio4j + Cloud (1) and Graphs We use AWS (Amazon Web Services) everywhere we can around Initial motivation Bio4j, giving us the following benefits: Bio4j structure Interoperability and data distribution Some samples Releases are available as public EBS Snapshots, giving AWS users the opportunity of creating and attaching to their instances Bio4j DB 100% ready volumes in just a few seconds. Why Bio4j? Bio4j and the CloudFormation templates: Cloud - Basic Bio4j DB Instance Upcoming features - Bio4j REST Server Instance www.ohnosequences.com www.bio4j.com
  • 33. Bioinformatics DBs Bio4j + Cloud (2) and Graphs Initial motivation Backup and Storage using S3 (Simple Storage Service) We use S3 both for backup (indirectly through the EBS snapshots) and Bio4j structure storage (directly storing RefSeq sequences as independent S3 files) What kind of benefits do we get from this? Some samples • Easy to use Why Bio4j? • Flexible • Cost-Effective Bio4j and the Cloud • Reliable • Scalable and high-performance Upcoming features • Secure www.ohnosequences.com www.bio4j.com
  • 34. Bioinformatics DBs Bio4j + Cloud (3) and Graphs Initial motivation Web servers and service providers in the cloud Deploying your own web server in AWS using Bio4j as back-end is really Bio4j structure simple. A good example of this would be Bio4jTestServer, a continuously Some samples developed server showcasing Web Services based on Bio4j. Why Bio4j? Bio4j and the Cloud Upcoming features www.ohnosequences.com www.bio4j.com
  • 35. Bioinformatics DBs and Graphs Upcoming features - Relationship indexing for relationships going and coming from supernodes Initial motivation No one’s perfect, and Bio4j is not the exception. Relationship fetching can become a bottleneck whenever you have to deal Bio4j structure with supernodes (unless you index these relationships). Fortunately this is something that Neo4j is going to fix in the next version(s). Some samples - More resources available (Reactome…) Why Bio4j? - Improvements in the importing process - A more complete version of Bio4jModel Bio4j and the Cloud Allowing users to perform almost all sorts of queries without having to worry about Neo4j core API. Upcoming - New tools, services and visualizations built around Bio4j features www.ohnosequences.com www.bio4j.com
  • 36. Bioinformatics DBs Community and Graphs Bio4j has a fast growing internet presence: Initial motivation Bio4j structure - Twitter: check @bio4j for updates - Blog: go to http://blog.bio4j.com Some samples - Mail-list: ask any question you may have in our list. Why Bio4j? - LinkedIn: check the Bio4j group Bio4j and the Cloud - Github issues: don’t be shy! open a new issue if you think something’s going wrong. Upcoming features www.ohnosequences.com www.bio4j.com
  • 37. Bioinformatics DBs and... Who’s behind all this? and Graphs Bio4j is being developed by Oh no sequences! Team and Initial motivation Era7 Bioinformatics members: Bio4j structure - Pablo Pareja Tobes: Main developer (that’s me!) Some samples - Eduardo Pareja Tobes: Technology and architecture main advisor Why Bio4j? - Raquel Tobes: Bioinformatics main advisor Bio4j and the Cloud - Marina Manrique: Bioinformatics support Upcoming features - Eduardo Pareja: Scientific advisor www.ohnosequences.com www.bio4j.com
  • 38. Bioinformatics DBs and Graphs Initial motivation Bio4j structure That’s it ! Some samples Why Bio4j? Thanks for your time ;) Bio4j and the Cloud Upcoming features www.ohnosequences.com www.bio4j.com