SlideShare une entreprise Scribd logo
1  sur  13
The new EBI search engine: EB-eye Backend : An overview of what is under the hood Industry Workshop 21-22 May, 2007 Franck Valentin – External Services group
Summary ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
What is the data available ? Ligand > 20 domains >137M entries > 550 Gb of data
What is the data available – formats Ligand <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> ID  : ..  PARENT ID : .. RANK  : .. ... ID ... AC ... DT ... ID ... AC ... DT ...
What is the data available – sizes 43M 4.2G 57Gb, >500 files   1G 8.4G 374Gb, >600 files   6.3G 25K 81M
Points to take into consideration ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Parsing and indexing different formats Indexer Lucene API Db EMBL grammar Taxonomy grammar UniProt grammar . . . Parser (ANTXR) Medline grammar InterPro grammar Dump file grammar . . . Parser (ANTLR) Uniprot Index Embl Index Taxonomy Index ID  AF030562; SV 1; linear; genomic DNA; STS; FUN; 852 BP. AC  AF030562; DT  04-DEC-1997 (Rel. 53, Created) DT  03-MAR-2000 (Rel. 62, Last updated, Version 2) XX DE  Fusarium venenatum clone VEN-A RAPD band generated using Operon primer DE  OPW-03, sequence tagged site. . . . Flat files <MedlineCitationSet> <MedlineCitation Owner=&quot;NLM&quot; Status=&quot;MEDLINE&quot;> <PMID>10997935</PMID> <DateCreated> <Year>2000</Year> <Month>10</Month> <Day>04</Day> </DateCreated> … XML files <MedlineCitationSet> <MedlineCitation Owner=&quot;NLM&quot; Status=&quot;MEDLINE&quot;> <PMID> 14216186 </PMID> <DateCreated> <Year> 1965 </Year> <Month> 02 </Month> <Day> 01 </Day> </DateCreated> <DateCompleted> <Year> 1996 </Year> <Month> 12 </Month> <Day> 01 </Day> </DateCompleted> <DateRevised> <Year>2007</Year> <Month>03</Month> <Day>01</Day> </DateRevised> <Article PubModel=&quot;Print&quot;> <Journal> <ISSN IssnType=&quot;Print&quot;> 0009-8981 </ISSN> <JournalIssue CitedMedium=&quot;Print&quot;> <Volume> 10 </Volume> <PubDate> <Year>1964</Year> <Month>Jul</Month> </PubDate> </JournalIssue> <Title> Clinica chimica acta; international journal of clinical chemistry </Title> <ISOAbbreviation>Clin. Chim. Acta</ISOAbbreviation> </Journal> . . . . . . ID Creation Date Modification Date issn volume name ID  AF030562 ; SV 1; linear; genomic DNA; STS; FUN; 852 BP. XX AC  AF030562 ; XX DT  04-DEC-1997  (Rel. 53, Created) DT  03-MAR-2000  (Rel. 62, Last updated, Version 2) XX DE  Fusarium venenatum clone VEN-A RAPD band generated using Operon primer DE  OPW-03, sequence tagged site . XX KW  STS. XX OS  Fusarium venenatum OC  Eukaryota; Fungi; Ascomycota; Pezizomycotina; Sordariomycetes ; OC  Hypocreomycetidae; Hypocreales; mitosporic Hypocreales; Fusarium . XX RN  [1] RP  1-852 RA  Yoder W.T., Christianson L.M .; RT  &quot;Species-specific primers resolve members of the section Fusarium . RT  Taxonomic status of the edible 'Quorn' fungus re-evaluated &quot;; RL  Fungal Genet. Biol. 0:0-0(1997). XX RN  [2] RP  1-852 RA  Yoder W.T., Christianson L.M.; RT  ; RL  Submitted (21-OCT-1997) to the EMBL/GenBank/DDBJ databases . RL  Microbiology, Novo Nordisk Biotech, Inc., 1445 Drew Ave., Davis, CA 95616 , RL  USA XX FH  Key  Location/Qualifiers FH FT  source  1..852 FT  /organism=&quot;Fusarium venenatum&quot; FT  /strain=&quot;ATCC20334“ FT  /db_xref=&quot;taxon:56646&quot; . . .  ID AC Creation date / Modification date Description Organism species Organism classes References References <database> <name>IntAct.Experiment</name> <description>Experimental procedures that allowed to…</description> <release>1.0</release> <release_date>2007-Feb-16</release_date> <entry_count>5697</entry_count> <entries> <entry id=&quot;EBI-77680&quot;> … Dump file (XML)
Divide and Conquer the Indexing UniProt (>4M entries)   Embl (>83M entries) 2 files,  ~ 9.4G >600 files ~ 375G Medline (>16M entries) >500 files ~ 57G Taxonomy (>0.37M entries)   1 file,  ~ 81M GO (>0.23M entries) 1 file ~ 27M Others (ArrayExpress Ensembl, Intact, …) XML XML XML dump XML dump XML dump 8 cpu 8 cpu 8 cpu 8 cpu XML XML XML dump XML dump XML dump Embl Index Uniprot Index Embl Index Taxonomy Index Medline Index ArrayExpress Index Ensembl Index Intact Index XML XML XML dump XML dump XML dump Db
Let’s put some figures on it Less than 18 hours to index all the EBI
Web side story UniProt Index Embl Index Taxonomy Index Medline Index ArrayExpress Index Ensembl Index Intact Index Load balancer Tomcat 1 Tomcat 2 Tomcat 3 Tomcat 4
Being up to date ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Libraries ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Acknowledgements ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Contenu connexe

En vedette

Tag yourlife(nfclab9月定例会発表資料)
Tag yourlife(nfclab9月定例会発表資料)Tag yourlife(nfclab9月定例会発表資料)
Tag yourlife(nfclab9月定例会発表資料)Yasuhiro Ohsaka
 
回想支援ツールNFC仏壇
回想支援ツールNFC仏壇回想支援ツールNFC仏壇
回想支援ツールNFC仏壇Yasuhiro Ohsaka
 
Freedom, Money, Time and the Key to Creative Success
Freedom, Money, Time and the Key to Creative SuccessFreedom, Money, Time and the Key to Creative Success
Freedom, Money, Time and the Key to Creative SuccessPaul Smith
 
Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...
Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...
Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...Kantar
 
Topshop Power Point Ah, Gl, Jb.
Topshop Power Point   Ah, Gl, Jb.Topshop Power Point   Ah, Gl, Jb.
Topshop Power Point Ah, Gl, Jb.Marcus9000
 
How to do business in the Indian Market for Kiko Milano.
How to do business in the Indian Market for Kiko Milano.How to do business in the Indian Market for Kiko Milano.
How to do business in the Indian Market for Kiko Milano.Giacomo Caleffi
 

En vedette (6)

Tag yourlife(nfclab9月定例会発表資料)
Tag yourlife(nfclab9月定例会発表資料)Tag yourlife(nfclab9月定例会発表資料)
Tag yourlife(nfclab9月定例会発表資料)
 
回想支援ツールNFC仏壇
回想支援ツールNFC仏壇回想支援ツールNFC仏壇
回想支援ツールNFC仏壇
 
Freedom, Money, Time and the Key to Creative Success
Freedom, Money, Time and the Key to Creative SuccessFreedom, Money, Time and the Key to Creative Success
Freedom, Money, Time and the Key to Creative Success
 
Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...
Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...
Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...
 
Topshop Power Point Ah, Gl, Jb.
Topshop Power Point   Ah, Gl, Jb.Topshop Power Point   Ah, Gl, Jb.
Topshop Power Point Ah, Gl, Jb.
 
How to do business in the Indian Market for Kiko Milano.
How to do business in the Indian Market for Kiko Milano.How to do business in the Indian Market for Kiko Milano.
How to do business in the Indian Market for Kiko Milano.
 

Similaire à EB-eye Backend: An Overview of the Indexing and Search Capabilities Behind EBI's New Search Engine

Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchAnshika Bansal
 
20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing CoursePierre Lindenbaum
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2Daniel JACOB
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012Dan Gaston
 
Bioinfomatics Presentation
Bioinfomatics PresentationBioinfomatics Presentation
Bioinfomatics PresentationZhenhong Bao
 
Biomart Update
Biomart UpdateBiomart Update
Biomart Updatebosc
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesYasset Perez-Riverol
 
20110725 ibc xml
20110725 ibc xml20110725 ibc xml
20110725 ibc xmlagosti
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
A Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableA Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableDATAVERSITY
 

Similaire à EB-eye Backend: An Overview of the Indexing and Search Capabilities Behind EBI's New Search Engine (20)

Biological databases
Biological databasesBiological databases
Biological databases
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
Gen bank
Gen bankGen bank
Gen bank
 
2016 02 23_biological_databases_part1
2016 02 23_biological_databases_part12016 02 23_biological_databases_part1
2016 02 23_biological_databases_part1
 
Bio2RDF@BH2010
Bio2RDF@BH2010Bio2RDF@BH2010
Bio2RDF@BH2010
 
20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course
 
SAFE EDBT 2011
SAFE EDBT 2011SAFE EDBT 2011
SAFE EDBT 2011
 
NCBI
NCBINCBI
NCBI
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
Bioinfomatics Presentation
Bioinfomatics PresentationBioinfomatics Presentation
Bioinfomatics Presentation
 
Biomart Update
Biomart UpdateBiomart Update
Biomart Update
 
20120423.NGS.Rennes
20120423.NGS.Rennes20120423.NGS.Rennes
20120423.NGS.Rennes
 
Odp
OdpOdp
Odp
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata files
 
20110725 ibc xml
20110725 ibc xml20110725 ibc xml
20110725 ibc xml
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
A Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableA Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with Hypertable
 
Major biological nucleotide databases
Major biological nucleotide databasesMajor biological nucleotide databases
Major biological nucleotide databases
 
Gen bank (genetic sequence databank)
Gen bank (genetic sequence databank)Gen bank (genetic sequence databank)
Gen bank (genetic sequence databank)
 

Dernier

Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 

Dernier (20)

Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 

EB-eye Backend: An Overview of the Indexing and Search Capabilities Behind EBI's New Search Engine

  • 1. The new EBI search engine: EB-eye Backend : An overview of what is under the hood Industry Workshop 21-22 May, 2007 Franck Valentin – External Services group
  • 2.
  • 3. What is the data available ? Ligand > 20 domains >137M entries > 550 Gb of data
  • 4. What is the data available – formats Ligand <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> ID : .. PARENT ID : .. RANK : .. ... ID ... AC ... DT ... ID ... AC ... DT ...
  • 5. What is the data available – sizes 43M 4.2G 57Gb, >500 files 1G 8.4G 374Gb, >600 files 6.3G 25K 81M
  • 6.
  • 7. Parsing and indexing different formats Indexer Lucene API Db EMBL grammar Taxonomy grammar UniProt grammar . . . Parser (ANTXR) Medline grammar InterPro grammar Dump file grammar . . . Parser (ANTLR) Uniprot Index Embl Index Taxonomy Index ID AF030562; SV 1; linear; genomic DNA; STS; FUN; 852 BP. AC AF030562; DT 04-DEC-1997 (Rel. 53, Created) DT 03-MAR-2000 (Rel. 62, Last updated, Version 2) XX DE Fusarium venenatum clone VEN-A RAPD band generated using Operon primer DE OPW-03, sequence tagged site. . . . Flat files <MedlineCitationSet> <MedlineCitation Owner=&quot;NLM&quot; Status=&quot;MEDLINE&quot;> <PMID>10997935</PMID> <DateCreated> <Year>2000</Year> <Month>10</Month> <Day>04</Day> </DateCreated> … XML files <MedlineCitationSet> <MedlineCitation Owner=&quot;NLM&quot; Status=&quot;MEDLINE&quot;> <PMID> 14216186 </PMID> <DateCreated> <Year> 1965 </Year> <Month> 02 </Month> <Day> 01 </Day> </DateCreated> <DateCompleted> <Year> 1996 </Year> <Month> 12 </Month> <Day> 01 </Day> </DateCompleted> <DateRevised> <Year>2007</Year> <Month>03</Month> <Day>01</Day> </DateRevised> <Article PubModel=&quot;Print&quot;> <Journal> <ISSN IssnType=&quot;Print&quot;> 0009-8981 </ISSN> <JournalIssue CitedMedium=&quot;Print&quot;> <Volume> 10 </Volume> <PubDate> <Year>1964</Year> <Month>Jul</Month> </PubDate> </JournalIssue> <Title> Clinica chimica acta; international journal of clinical chemistry </Title> <ISOAbbreviation>Clin. Chim. Acta</ISOAbbreviation> </Journal> . . . . . . ID Creation Date Modification Date issn volume name ID AF030562 ; SV 1; linear; genomic DNA; STS; FUN; 852 BP. XX AC AF030562 ; XX DT 04-DEC-1997 (Rel. 53, Created) DT 03-MAR-2000 (Rel. 62, Last updated, Version 2) XX DE Fusarium venenatum clone VEN-A RAPD band generated using Operon primer DE OPW-03, sequence tagged site . XX KW STS. XX OS Fusarium venenatum OC Eukaryota; Fungi; Ascomycota; Pezizomycotina; Sordariomycetes ; OC Hypocreomycetidae; Hypocreales; mitosporic Hypocreales; Fusarium . XX RN [1] RP 1-852 RA Yoder W.T., Christianson L.M .; RT &quot;Species-specific primers resolve members of the section Fusarium . RT Taxonomic status of the edible 'Quorn' fungus re-evaluated &quot;; RL Fungal Genet. Biol. 0:0-0(1997). XX RN [2] RP 1-852 RA Yoder W.T., Christianson L.M.; RT ; RL Submitted (21-OCT-1997) to the EMBL/GenBank/DDBJ databases . RL Microbiology, Novo Nordisk Biotech, Inc., 1445 Drew Ave., Davis, CA 95616 , RL USA XX FH Key Location/Qualifiers FH FT source 1..852 FT /organism=&quot;Fusarium venenatum&quot; FT /strain=&quot;ATCC20334“ FT /db_xref=&quot;taxon:56646&quot; . . . ID AC Creation date / Modification date Description Organism species Organism classes References References <database> <name>IntAct.Experiment</name> <description>Experimental procedures that allowed to…</description> <release>1.0</release> <release_date>2007-Feb-16</release_date> <entry_count>5697</entry_count> <entries> <entry id=&quot;EBI-77680&quot;> … Dump file (XML)
  • 8. Divide and Conquer the Indexing UniProt (>4M entries) Embl (>83M entries) 2 files, ~ 9.4G >600 files ~ 375G Medline (>16M entries) >500 files ~ 57G Taxonomy (>0.37M entries) 1 file, ~ 81M GO (>0.23M entries) 1 file ~ 27M Others (ArrayExpress Ensembl, Intact, …) XML XML XML dump XML dump XML dump 8 cpu 8 cpu 8 cpu 8 cpu XML XML XML dump XML dump XML dump Embl Index Uniprot Index Embl Index Taxonomy Index Medline Index ArrayExpress Index Ensembl Index Intact Index XML XML XML dump XML dump XML dump Db
  • 9. Let’s put some figures on it Less than 18 hours to index all the EBI
  • 10. Web side story UniProt Index Embl Index Taxonomy Index Medline Index ArrayExpress Index Ensembl Index Intact Index Load balancer Tomcat 1 Tomcat 2 Tomcat 3 Tomcat 4
  • 11.
  • 12.
  • 13.