The document summarizes the backend systems and processes that power the new EBI search engine EB-eye. It describes the large amounts and various formats of data being indexed, the parsing and indexing of different data formats using various tools, and the distributed indexing approach across multiple servers that allows indexing to be completed in under 18 hours. It also provides an overview of the web frontend and load balancing, as well as future plans for automatic updates and verifications.
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
EB-eye Backend: An Overview of the Indexing and Search Capabilities Behind EBI's New Search Engine
1. The new EBI search engine: EB-eye Backend : An overview of what is under the hood Industry Workshop 21-22 May, 2007 Franck Valentin – External Services group
2.
3. What is the data available ? Ligand > 20 domains >137M entries > 550 Gb of data
4. What is the data available – formats Ligand <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> <XML> . . . </XML> ID : .. PARENT ID : .. RANK : .. ... ID ... AC ... DT ... ID ... AC ... DT ...
5. What is the data available – sizes 43M 4.2G 57Gb, >500 files 1G 8.4G 374Gb, >600 files 6.3G 25K 81M
6.
7. Parsing and indexing different formats Indexer Lucene API Db EMBL grammar Taxonomy grammar UniProt grammar . . . Parser (ANTXR) Medline grammar InterPro grammar Dump file grammar . . . Parser (ANTLR) Uniprot Index Embl Index Taxonomy Index ID AF030562; SV 1; linear; genomic DNA; STS; FUN; 852 BP. AC AF030562; DT 04-DEC-1997 (Rel. 53, Created) DT 03-MAR-2000 (Rel. 62, Last updated, Version 2) XX DE Fusarium venenatum clone VEN-A RAPD band generated using Operon primer DE OPW-03, sequence tagged site. . . . Flat files <MedlineCitationSet> <MedlineCitation Owner="NLM" Status="MEDLINE"> <PMID>10997935</PMID> <DateCreated> <Year>2000</Year> <Month>10</Month> <Day>04</Day> </DateCreated> … XML files <MedlineCitationSet> <MedlineCitation Owner="NLM" Status="MEDLINE"> <PMID> 14216186 </PMID> <DateCreated> <Year> 1965 </Year> <Month> 02 </Month> <Day> 01 </Day> </DateCreated> <DateCompleted> <Year> 1996 </Year> <Month> 12 </Month> <Day> 01 </Day> </DateCompleted> <DateRevised> <Year>2007</Year> <Month>03</Month> <Day>01</Day> </DateRevised> <Article PubModel="Print"> <Journal> <ISSN IssnType="Print"> 0009-8981 </ISSN> <JournalIssue CitedMedium="Print"> <Volume> 10 </Volume> <PubDate> <Year>1964</Year> <Month>Jul</Month> </PubDate> </JournalIssue> <Title> Clinica chimica acta; international journal of clinical chemistry </Title> <ISOAbbreviation>Clin. Chim. Acta</ISOAbbreviation> </Journal> . . . . . . ID Creation Date Modification Date issn volume name ID AF030562 ; SV 1; linear; genomic DNA; STS; FUN; 852 BP. XX AC AF030562 ; XX DT 04-DEC-1997 (Rel. 53, Created) DT 03-MAR-2000 (Rel. 62, Last updated, Version 2) XX DE Fusarium venenatum clone VEN-A RAPD band generated using Operon primer DE OPW-03, sequence tagged site . XX KW STS. XX OS Fusarium venenatum OC Eukaryota; Fungi; Ascomycota; Pezizomycotina; Sordariomycetes ; OC Hypocreomycetidae; Hypocreales; mitosporic Hypocreales; Fusarium . XX RN [1] RP 1-852 RA Yoder W.T., Christianson L.M .; RT "Species-specific primers resolve members of the section Fusarium . RT Taxonomic status of the edible 'Quorn' fungus re-evaluated "; RL Fungal Genet. Biol. 0:0-0(1997). XX RN [2] RP 1-852 RA Yoder W.T., Christianson L.M.; RT ; RL Submitted (21-OCT-1997) to the EMBL/GenBank/DDBJ databases . RL Microbiology, Novo Nordisk Biotech, Inc., 1445 Drew Ave., Davis, CA 95616 , RL USA XX FH Key Location/Qualifiers FH FT source 1..852 FT /organism="Fusarium venenatum" FT /strain="ATCC20334“ FT /db_xref="taxon:56646" . . . ID AC Creation date / Modification date Description Organism species Organism classes References References <database> <name>IntAct.Experiment</name> <description>Experimental procedures that allowed to…</description> <release>1.0</release> <release_date>2007-Feb-16</release_date> <entry_count>5697</entry_count> <entries> <entry id="EBI-77680"> … Dump file (XML)
8. Divide and Conquer the Indexing UniProt (>4M entries) Embl (>83M entries) 2 files, ~ 9.4G >600 files ~ 375G Medline (>16M entries) >500 files ~ 57G Taxonomy (>0.37M entries) 1 file, ~ 81M GO (>0.23M entries) 1 file ~ 27M Others (ArrayExpress Ensembl, Intact, …) XML XML XML dump XML dump XML dump 8 cpu 8 cpu 8 cpu 8 cpu XML XML XML dump XML dump XML dump Embl Index Uniprot Index Embl Index Taxonomy Index Medline Index ArrayExpress Index Ensembl Index Intact Index XML XML XML dump XML dump XML dump Db
9. Let’s put some figures on it Less than 18 hours to index all the EBI
10. Web side story UniProt Index Embl Index Taxonomy Index Medline Index ArrayExpress Index Ensembl Index Intact Index Load balancer Tomcat 1 Tomcat 2 Tomcat 3 Tomcat 4