SlideShare une entreprise Scribd logo
1  sur  13
The new EBI search engine: EB-eye Backend : An overview of what is under the hood Industry Workshop 21-22 May, 2007 Franck Valentin – External Services group
Summary ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
What is the data available ? Ligand > 20 domains >137M entries > 550 Gb of data
What is the data available – formats Ligand <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> ID  : ..  PARENT ID : .. RANK  : .. ... ID ... AC ... DT ... ID ... AC ... DT ...
What is the data available – sizes 43M 4.2G 57Gb, >500 files   1G 8.4G 374Gb, >600 files   6.3G 25K 81M
Points to take into consideration ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Parsing and indexing different formats Indexer Lucene API Db EMBL grammar Taxonomy grammar UniProt grammar . . . Parser (ANTXR) Medline grammar InterPro grammar Dump file grammar . . . Parser (ANTLR) Uniprot Index Embl Index Taxonomy Index ID  AF030562; SV 1; linear; genomic DNA; STS; FUN; 852 BP. AC  AF030562; DT  04-DEC-1997 (Rel. 53, Created) DT  03-MAR-2000 (Rel. 62, Last updated, Version 2) XX DE  Fusarium venenatum clone VEN-A RAPD band generated using Operon primer DE  OPW-03, sequence tagged site. . . . Flat files <MedlineCitationSet> <MedlineCitation Owner=&quot;NLM&quot; Status=&quot;MEDLINE&quot;> <PMID>10997935</PMID> <DateCreated> <Year>2000</Year> <Month>10</Month> <Day>04</Day> </DateCreated> … XML files <MedlineCitationSet> <MedlineCitation Owner=&quot;NLM&quot; Status=&quot;MEDLINE&quot;> <PMID> 14216186 </PMID> <DateCreated> <Year> 1965 </Year> <Month> 02 </Month> <Day> 01 </Day> </DateCreated> <DateCompleted> <Year> 1996 </Year> <Month> 12 </Month> <Day> 01 </Day> </DateCompleted> <DateRevised> <Year>2007</Year> <Month>03</Month> <Day>01</Day> </DateRevised> <Article PubModel=&quot;Print&quot;> <Journal> <ISSN IssnType=&quot;Print&quot;> 0009-8981 </ISSN> <JournalIssue CitedMedium=&quot;Print&quot;> <Volume> 10 </Volume> <PubDate> <Year>1964</Year> <Month>Jul</Month> </PubDate> </JournalIssue> <Title> Clinica chimica acta; international journal of clinical chemistry </Title> <ISOAbbreviation>Clin. Chim. Acta</ISOAbbreviation> </Journal> . . . . . . ID Creation Date Modification Date issn volume name ID  AF030562 ; SV 1; linear; genomic DNA; STS; FUN; 852 BP. XX AC  AF030562 ; XX DT  04-DEC-1997  (Rel. 53, Created) DT  03-MAR-2000  (Rel. 62, Last updated, Version 2) XX DE  Fusarium venenatum clone VEN-A RAPD band generated using Operon primer DE  OPW-03, sequence tagged site . XX KW  STS. XX OS  Fusarium venenatum OC  Eukaryota; Fungi; Ascomycota; Pezizomycotina; Sordariomycetes ; OC  Hypocreomycetidae; Hypocreales; mitosporic Hypocreales; Fusarium . XX RN  [1] RP  1-852 RA  Yoder W.T., Christianson L.M .; RT  &quot;Species-specific primers resolve members of the section Fusarium . RT  Taxonomic status of the edible 'Quorn' fungus re-evaluated &quot;; RL  Fungal Genet. Biol. 0:0-0(1997). XX RN  [2] RP  1-852 RA  Yoder W.T., Christianson L.M.; RT  ; RL  Submitted (21-OCT-1997) to the EMBL/GenBank/DDBJ databases . RL  Microbiology, Novo Nordisk Biotech, Inc., 1445 Drew Ave., Davis, CA 95616 , RL  USA XX FH  Key  Location/Qualifiers FH FT  source  1..852 FT  /organism=&quot;Fusarium venenatum&quot; FT  /strain=&quot;ATCC20334“ FT  /db_xref=&quot;taxon:56646&quot; . . .  ID AC Creation date / Modification date Description Organism species Organism classes References References <database> <name>IntAct.Experiment</name> <description>Experimental procedures that allowed to…</description> <release>1.0</release> <release_date>2007-Feb-16</release_date> <entry_count>5697</entry_count> <entries> <entry id=&quot;EBI-77680&quot;> … Dump file (XML)
Divide and Conquer the Indexing UniProt (>4M entries)   Embl (>83M entries) 2 files,  ~ 9.4G >600 files ~ 375G Medline (>16M entries) >500 files ~ 57G Taxonomy (>0.37M entries)   1 file,  ~ 81M GO (>0.23M entries) 1 file ~ 27M Others (ArrayExpress Ensembl, Intact, …) XML XML XML dump XML dump XML dump 8 cpu 8 cpu 8 cpu 8 cpu XML XML XML dump XML dump XML dump Embl Index Uniprot Index Embl Index Taxonomy Index Medline Index ArrayExpress Index Ensembl Index Intact Index XML XML XML dump XML dump XML dump Db
Let’s put some figures on it Less than 18 hours to index all the EBI
Web side story UniProt Index Embl Index Taxonomy Index Medline Index ArrayExpress Index Ensembl Index Intact Index Load balancer Tomcat 1 Tomcat 2 Tomcat 3 Tomcat 4
Being up to date ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Libraries ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Acknowledgements ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Contenu connexe

En vedette

Tag yourlife(nfclab9月定例会発表資料)
Tag yourlife(nfclab9月定例会発表資料)Tag yourlife(nfclab9月定例会発表資料)
Tag yourlife(nfclab9月定例会発表資料)Yasuhiro Ohsaka
 
回想支援ツールNFC仏壇
回想支援ツールNFC仏壇回想支援ツールNFC仏壇
回想支援ツールNFC仏壇Yasuhiro Ohsaka
 
Freedom, Money, Time and the Key to Creative Success
Freedom, Money, Time and the Key to Creative SuccessFreedom, Money, Time and the Key to Creative Success
Freedom, Money, Time and the Key to Creative SuccessPaul Smith
 
Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...
Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...
Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...Kantar
 
Topshop Power Point Ah, Gl, Jb.
Topshop Power Point   Ah, Gl, Jb.Topshop Power Point   Ah, Gl, Jb.
Topshop Power Point Ah, Gl, Jb.Marcus9000
 
How to do business in the Indian Market for Kiko Milano.
How to do business in the Indian Market for Kiko Milano.How to do business in the Indian Market for Kiko Milano.
How to do business in the Indian Market for Kiko Milano.Giacomo Caleffi
 

En vedette (6)

Tag yourlife(nfclab9月定例会発表資料)
Tag yourlife(nfclab9月定例会発表資料)Tag yourlife(nfclab9月定例会発表資料)
Tag yourlife(nfclab9月定例会発表資料)
 
回想支援ツールNFC仏壇
回想支援ツールNFC仏壇回想支援ツールNFC仏壇
回想支援ツールNFC仏壇
 
Freedom, Money, Time and the Key to Creative Success
Freedom, Money, Time and the Key to Creative SuccessFreedom, Money, Time and the Key to Creative Success
Freedom, Money, Time and the Key to Creative Success
 
Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...
Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...
Dynamic Logic AdReaction 2009 - What Marketers Should Know About Who’s Gettin...
 
Topshop Power Point Ah, Gl, Jb.
Topshop Power Point   Ah, Gl, Jb.Topshop Power Point   Ah, Gl, Jb.
Topshop Power Point Ah, Gl, Jb.
 
How to do business in the Indian Market for Kiko Milano.
How to do business in the Indian Market for Kiko Milano.How to do business in the Indian Market for Kiko Milano.
How to do business in the Indian Market for Kiko Milano.
 

Similaire à EB-eye Backend: An Overview of the Indexing and Search Capabilities Behind EBI's New Search Engine

Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchAnshika Bansal
 
20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing CoursePierre Lindenbaum
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2Daniel JACOB
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012Dan Gaston
 
Bioinfomatics Presentation
Bioinfomatics PresentationBioinfomatics Presentation
Bioinfomatics PresentationZhenhong Bao
 
Biomart Update
Biomart UpdateBiomart Update
Biomart Updatebosc
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesYasset Perez-Riverol
 
20110725 ibc xml
20110725 ibc xml20110725 ibc xml
20110725 ibc xmlagosti
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
A Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableA Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableDATAVERSITY
 

Similaire à EB-eye Backend: An Overview of the Indexing and Search Capabilities Behind EBI's New Search Engine (20)

Biological databases
Biological databasesBiological databases
Biological databases
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
Gen bank
Gen bankGen bank
Gen bank
 
2016 02 23_biological_databases_part1
2016 02 23_biological_databases_part12016 02 23_biological_databases_part1
2016 02 23_biological_databases_part1
 
Bio2RDF@BH2010
Bio2RDF@BH2010Bio2RDF@BH2010
Bio2RDF@BH2010
 
20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course
 
SAFE EDBT 2011
SAFE EDBT 2011SAFE EDBT 2011
SAFE EDBT 2011
 
NCBI
NCBINCBI
NCBI
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
Bioinfomatics Presentation
Bioinfomatics PresentationBioinfomatics Presentation
Bioinfomatics Presentation
 
Biomart Update
Biomart UpdateBiomart Update
Biomart Update
 
20120423.NGS.Rennes
20120423.NGS.Rennes20120423.NGS.Rennes
20120423.NGS.Rennes
 
Odp
OdpOdp
Odp
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata files
 
20110725 ibc xml
20110725 ibc xml20110725 ibc xml
20110725 ibc xml
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
A Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with HypertableA Genome Sequence Analysis System Built with Hypertable
A Genome Sequence Analysis System Built with Hypertable
 
Major biological nucleotide databases
Major biological nucleotide databasesMajor biological nucleotide databases
Major biological nucleotide databases
 
Gen bank (genetic sequence databank)
Gen bank (genetic sequence databank)Gen bank (genetic sequence databank)
Gen bank (genetic sequence databank)
 

Dernier

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Dernier (20)

The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

EB-eye Backend: An Overview of the Indexing and Search Capabilities Behind EBI's New Search Engine

  • 1. The new EBI search engine: EB-eye Backend : An overview of what is under the hood Industry Workshop 21-22 May, 2007 Franck Valentin – External Services group
  • 2.
  • 3. What is the data available ? Ligand > 20 domains >137M entries > 550 Gb of data
  • 4. What is the data available – formats Ligand <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> ID : .. PARENT ID : .. RANK : .. ... ID ... AC ... DT ... ID ... AC ... DT ...
  • 5. What is the data available – sizes 43M 4.2G 57Gb, >500 files 1G 8.4G 374Gb, >600 files 6.3G 25K 81M
  • 6.
  • 7. Parsing and indexing different formats Indexer Lucene API Db EMBL grammar Taxonomy grammar UniProt grammar . . . Parser (ANTXR) Medline grammar InterPro grammar Dump file grammar . . . Parser (ANTLR) Uniprot Index Embl Index Taxonomy Index ID AF030562; SV 1; linear; genomic DNA; STS; FUN; 852 BP. AC AF030562; DT 04-DEC-1997 (Rel. 53, Created) DT 03-MAR-2000 (Rel. 62, Last updated, Version 2) XX DE Fusarium venenatum clone VEN-A RAPD band generated using Operon primer DE OPW-03, sequence tagged site. . . . Flat files <MedlineCitationSet> <MedlineCitation Owner=&quot;NLM&quot; Status=&quot;MEDLINE&quot;> <PMID>10997935</PMID> <DateCreated> <Year>2000</Year> <Month>10</Month> <Day>04</Day> </DateCreated> … XML files <MedlineCitationSet> <MedlineCitation Owner=&quot;NLM&quot; Status=&quot;MEDLINE&quot;> <PMID> 14216186 </PMID> <DateCreated> <Year> 1965 </Year> <Month> 02 </Month> <Day> 01 </Day> </DateCreated> <DateCompleted> <Year> 1996 </Year> <Month> 12 </Month> <Day> 01 </Day> </DateCompleted> <DateRevised> <Year>2007</Year> <Month>03</Month> <Day>01</Day> </DateRevised> <Article PubModel=&quot;Print&quot;> <Journal> <ISSN IssnType=&quot;Print&quot;> 0009-8981 </ISSN> <JournalIssue CitedMedium=&quot;Print&quot;> <Volume> 10 </Volume> <PubDate> <Year>1964</Year> <Month>Jul</Month> </PubDate> </JournalIssue> <Title> Clinica chimica acta; international journal of clinical chemistry </Title> <ISOAbbreviation>Clin. Chim. Acta</ISOAbbreviation> </Journal> . . . . . . ID Creation Date Modification Date issn volume name ID AF030562 ; SV 1; linear; genomic DNA; STS; FUN; 852 BP. XX AC AF030562 ; XX DT 04-DEC-1997 (Rel. 53, Created) DT 03-MAR-2000 (Rel. 62, Last updated, Version 2) XX DE Fusarium venenatum clone VEN-A RAPD band generated using Operon primer DE OPW-03, sequence tagged site . XX KW STS. XX OS Fusarium venenatum OC Eukaryota; Fungi; Ascomycota; Pezizomycotina; Sordariomycetes ; OC Hypocreomycetidae; Hypocreales; mitosporic Hypocreales; Fusarium . XX RN [1] RP 1-852 RA Yoder W.T., Christianson L.M .; RT &quot;Species-specific primers resolve members of the section Fusarium . RT Taxonomic status of the edible 'Quorn' fungus re-evaluated &quot;; RL Fungal Genet. Biol. 0:0-0(1997). XX RN [2] RP 1-852 RA Yoder W.T., Christianson L.M.; RT ; RL Submitted (21-OCT-1997) to the EMBL/GenBank/DDBJ databases . RL Microbiology, Novo Nordisk Biotech, Inc., 1445 Drew Ave., Davis, CA 95616 , RL USA XX FH Key Location/Qualifiers FH FT source 1..852 FT /organism=&quot;Fusarium venenatum&quot; FT /strain=&quot;ATCC20334“ FT /db_xref=&quot;taxon:56646&quot; . . . ID AC Creation date / Modification date Description Organism species Organism classes References References <database> <name>IntAct.Experiment</name> <description>Experimental procedures that allowed to…</description> <release>1.0</release> <release_date>2007-Feb-16</release_date> <entry_count>5697</entry_count> <entries> <entry id=&quot;EBI-77680&quot;> … Dump file (XML)
  • 8. Divide and Conquer the Indexing UniProt (>4M entries) Embl (>83M entries) 2 files, ~ 9.4G >600 files ~ 375G Medline (>16M entries) >500 files ~ 57G Taxonomy (>0.37M entries) 1 file, ~ 81M GO (>0.23M entries) 1 file ~ 27M Others (ArrayExpress Ensembl, Intact, …) XML XML XML dump XML dump XML dump 8 cpu 8 cpu 8 cpu 8 cpu XML XML XML dump XML dump XML dump Embl Index Uniprot Index Embl Index Taxonomy Index Medline Index ArrayExpress Index Ensembl Index Intact Index XML XML XML dump XML dump XML dump Db
  • 9. Let’s put some figures on it Less than 18 hours to index all the EBI
  • 10. Web side story UniProt Index Embl Index Taxonomy Index Medline Index ArrayExpress Index Ensembl Index Intact Index Load balancer Tomcat 1 Tomcat 2 Tomcat 3 Tomcat 4
  • 11.
  • 12.
  • 13.