Contenu connexe Similaire à Lucene revolution with Data Harmony Similaire à Lucene revolution with Data Harmony (20) Plus de Access Innovations, Inc. Plus de Access Innovations, Inc. (20) Lucene revolution with Data Harmony1. Leveraging the Power of Lucene and XML for Instant Semantically Enriched Data Distribution Marjorie Hlava, President and Chairman, Lamine Idjeraoui, Java / Lucene programmer Access Innovations, Inc.mhlava@accessinn.com, October 8, 2010 2. Outline for today The Case NICEM / Media Sleuth The Challenges Taxonomy - Semantic Enrichment Lucene Deployment The Search Interface Lucene Effect Wrap up 3 ©2010 Access Innovations, Inc. All Rights Reserved 3. Access Innovations, Inc NICEM and Media Sleuth = the case Software and Service Company Founded 1978 Create and implement taxonomies Create thesauri Provide semantic enrichment tools Provide metadata extraction tools Standards early adopters and developers thesauri (ANSI/NISO Z.39.19), taxonomies, Metadata, Dublin Core, etc Developers of Data Harmony™ XML , JAVA, TCP/IP UNICODE 4 ©2010 Access Innovations, Inc. All Rights Reserved 4. National Information Center for Educational Media (NICEM) NICEM data base NICEM archive files 670,000 media from 25,000 sources MediaSleuth e-commerce platform to purchase media Once a staff of 31 editors now done by staff of two Created and Stored in an XML intranet system (XIS) Save to XIS Save to SQL Save to Lucene One click - ON SAVE 5 ©2010 Access Innovations, Inc. All Rights Reserved 5. Data Flow / Collection Producer and Distributor sources Catalogs Web sites Uploads Crawler harvesting and auto extraction auto indexing load to XIS NICEM thesaurus terms applied Editorial review 6 ©2010 Access Innovations, Inc. All Rights Reserved 6. NICEM Data Base Creation The database is created using the XIS - XML Intranet System There are 57 fields of data possible Many have a pick list or authority file Some have ranges of allowed values The NICEM Taxonomy is used to index all records MAI* is used to automatically suggest valid taxonomy terms Metadata extractor is used to pull the data from sources *MAI is Data Harmony’s automated indexer 7 ©2010 Access Innovations, Inc. All Rights Reserved 7. MediaSleuth E-commerce division of NICEM Utilizes database records from NICEM electronic database and adds e-commerce Calls on the NICEM taxonomy for an auto-completion feature at the time of Search. The search presentation layer Search Harmony Draws on the full thesaurus (taxonomy) Uses same terms as used in semantic enrichment of the sources 8 ©2010 Access Innovations, Inc. All Rights Reserved 8. Raw Full text data feeds NICEM data base creation SQL for ecommerce On Save XIS Creation XIS repository Printed source materials Load to NICEM Lucene Taxonomy terms Data Crawls on sources Add metadata Load to MediaSleuth Lucene MAI Concept Extractor Metadata Extractor MAI Rule Base Taxonomy Thesaurus Master Search Harmony Display 9 11. The Data Challenge The complexity of media Educational media Changes hands regularly – IP buy sell Changes format often ex. film – CD – streaming media One year 25% of the data changed format Linking related media Users with many search styles Need immediate access to changes No monthly cycle for loading allowed 12 ©2010 Access Innovations, Inc. All Rights Reserved 12. The Search Challenge Considerations Too long until available on website Use taxonomy for semantic enrichment Use the taxonomy in search XML records for portability Staff productivity Key hurdles One content set, two websites, three data files, from one data base E-commerce = YES Flexible search – match learning styles Support the ordering and delivery of media 13 ©2010 Access Innovations, Inc. All Rights Reserved 13. The NICEM Thesaurus(Taxonomy) Hierarchical outline of content by subject categories Basis for browsing Framework for content organization Increased recall Better precision High accuracy Terms total terms 5068 preferred terms 4133 nonpreferred terms (use or see also) 14 ©2010 Access Innovations, Inc. All Rights Reserved 16. NICEM – Lucene Deployment Query Query fetches hit list from SH, snippets from Repository. Search Search Harmony Presentation Layer Data forked so Data Harmony components can serve snippets and docs, and SH can build indexes. Lucene Index Auto-completion NavTree Narrower Terms Related Terms Building Lucene index Cleanup, etc. NICEM data base in XIS Repository XIS 17 ©2010 Access Innovations, Inc. All Rights Reserved 17. Technical Detail Before adding it to Lucene index, the data is submitted to DH MAI autoindexer to extract taxonomy terms. Code snippet thesTerms = getSuggestedTerms(data); //data is passed through DH indexer doc.add(thesTerms); // the suggested terms are added to Lucene doc doc.add(data); // add other data to Lucene doc writer.addDocument(doc); // add doc to Lucene index 18 ©2010 Access Innovations, Inc. All Rights Reserved 18. Taxonomy Search* on Lucene Auto-completion Using the Taxonomy Guide the user by applying various semantic relationships Navigate the full Taxonomy “tree” 19. Direct link to e-commerce to improve sales Link search and taxonomy directly to the supply or documents or by redirecting to a shopping cart 20 ©2010 Access Innovations, Inc. All Rights Reserved 20. Lucene / Solr EFFECT Lucene Search for NICEM and MediaSleuth Web site More items viewed, more items found, more orders Easy to implement taxonomy search Users find information faster Gave us the flexibility to do ON SAVE Multiple systems Semantics support contextual GoogleAds Web stats are up and increasing 21 ©2010 Access Innovations, Inc. All Rights Reserved 21. Overview NICEM / Media Sleuth Data Base creation Use XML and taxonomy Automate the content semantic enrichment High productivity achieved Lucene for search Semantically enhanced search Cost effective, high accuracy Thank you Lucid! 22 ©2010 Access Innovations, Inc. All Rights Reserved 22. Thank You! Marjorie M.K. Hlava President and Chairman mhlava@accessinn.comwww.taxodiary.com - the taxonomy news blog mmkhlava = twitter mhlava = facebook, linkedin, eacademy, plaxo Lamine Idjeraoui lamine_Idjeraoui@accessinn.com Access Innovations / Data Harmony www.dataharmony.comwww.accessinn.com505-998-0800 23 ©2010 Access Innovations, Inc. All Rights Reserved 23. Sources www.nicem.com www.mediasleuth.com Next Site search on other sites www.dataharmony.com www.accessinn.com “Indispensable for anyone trying to identify instructional media for teaching.” – CHOICE Magazine 24 ©2010 Access Innovations, Inc. All Rights Reserved 24. Data Harmony Architecture M.A.I. Rule Bases M.A.I. Concept Extractor Auto Summarization Entity Extractor Novelty Detection Search Software Search Indexes Thesaurus Master WEB Server I Data Harmony Administrative Module Rules for Concept Extractor SUBJECT TERMS ABSTRACT Dublin Core METADATA Library OPAC Database system Bibliographic citation with abstract Search Server Web Portals DH API Web Content Files, Documents DH CONCEPT EXTRACTION SYSTEM Databases Email, Groupware, etc. Taxonomies / ontology Auto-completion Broader Term Narrower Term Related Term Navigation Tree Categorization Inline tagging Query expansion using rule base Fast indexing Massive data sets Incremental indexing Fast query speeds Search within results ©2010 Access Innovations, Inc. All Rights Reserved 25. XIS/Data Harmony Written in JAVA (JAVA plug-in installs automatically) Stores data in XML format Web Services or Client Server Functions on any platform Windows, NT, Mac, Unix, Linux, Solaris SaaS and ASP available Password-controlled access 26 ©2010 Access Innovations, Inc. All Rights Reserved