The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Linking a Thesaurus To SharePoint for Content Management
1. Linking a Thesaurus To SharePoint
for Content Management
Scott Denning
Tao Liu
Access Innovations, Inc.
2. ASRT Taxonomy
• American Society of Radiologic Technologists
• Membership organization, more than 100,000
members
• Access Innovations, Inc.
• Taxonomy to encompass
– Knowledge domain
– Organizational structure
3. ASRT Taxonomy
• Intent was to have the taxonomy serve both
as a structure for indexing documents, and
eventually as a tool which would facilitate
keyword suggestion for documents at time of
generation.
• Thus, terms needed to be linked to content,
as well as descriptive of content
4. ASRT Taxonomy
• Not just for indexing, but in support of total
content management of documents from
many different sources
5. Requirements
• Use metadata from existing documents, as
well as providing/suggesting metadata for
created documents
• ASRT is a “MicroSoft Shop”
• Support storage as XML documents
• MS Office 2003, XML support features
• SharePoint™
6. SharePoint
• Supports taxonomies, but does not
provide taxonomies
• SharePoint’s strengths are collaboration,
version control, and searching.
• Provides some basic hierarchical structure:
– Categories
– Keywords
– “Best Bets”
7. The Challenges:
• Integrate ASRT taxonomy with SharePoint,
allowing users to exploit familiar features
while capitalizing on the hierarchical structure
of the taxonomy.
• Use M.A.I.™ (Machine Aided Indexer) to
suggest terms from the taxonomy as
keywords at the time of document
generation.
8. The Challenges – cont’d
• M.A.I. to run quietly in the background until
needed
• Provide/suggest indexing terms as document
is versioned or finalized
9. Requirements
• Encompass full trajectory of documents:
creation – search – repurposing - archiving
• Broad range of documents – administrative,
accounting, archival, educational, etc.
• Different document formats
• Flexible for content management
11. M.A.I. Considerations
• M.A.I. is a text-based tool; documents are in
many formats
• Should allow familiar SharePoint search
features to be used, while also suggesting
indexing terms/keywords
12. Access work
• Programs written to allow M.A.I. to handle
documents in different formats:
– Word (.doc)
– Excel (.xls)
– PowerPoint (.ppt)
– Portable Document Format (.pdf)
13. The Future?
• SharePoint/M.A.I. used to identify “expert
users” within ASRT, based upon congruency of
individuals’ keyword usage with taxonomy
terms
• M.A.I. embedded within/merged with other
programs, using versions of code written for
this project
Notes de l'éditeur
Indexed items in an electronic collection allow both higher recall and greater precision in search returns. How can this feature be implemented in a SharePoint Collaboration environment?
Our example for this discussion is the electronic collection of in-house documents – meeting minutes, committee proposals, reports to colleagues and to the membership, best practice documents, etc. – of the professional association ASRT. The taxonomy includes a collection of terms – single words or short phrases that represent the concepts included in the documents. Additionally, the taxonomy is organized in a hierarchy that mirrors the organization’s structure.
The terms that are included in a taxonomy vocabulary (aka a thesaurus ) should represent a single meaning whenever possible. (Some words have different meanings in different contexts such as ‘paper’. So, ‘white paper’, ‘paper stock’, ‘newspaper’ work better as concept terms since their meanings are less ambiguous than ‘paper’ by itself.) The term’s meaning should be what a reader would offer as the subject (or one of the subjects) of a document when describing its content.
The ASRT taxonomy, organized by operational units, provided a structure for file organization and storage and for website navigation.
Underlying requirements for this implementation recognized that documents would be in Microsoft application format and (for those to be published in journals) in XML format. Documents already included some metadata such as date created, date modified, author/creator, etc. Existing metadata needed to be preserved with additional metadata added. Additional metadata would include category and subject (indexing) terms to enhance the document “usability” and “finadability”.
SharePoint Server 2005 had already been implemented at ASRT. It includes a taxonomy feature which consists of a list of keywords that can include synonyms and weightings. Unfortunately, its implementation is cumbersome and doesn’t achieve the expected results. A solution that enhances SharePoint’s strengths was needed.
The taxonomy design was carefully planned to best suit organizational needs. The configuration of SharePoint and organization of its storage needed to reflect the considerations addressed in the taxonomy design. Additionally, the SharePoint search engine “keyword search” feature needed to be implemented to produce the enhanced search results.
The Data Harmony Machine Aided Indexer (M.A.I.) can suggest keywords. It just needed to be integrated with the SharePoint workflow to quietly “do its stuff”.
The integration had to take into consideration document use, category, format and destination properties.
The services of a Microsoft Solutions Partner, Interlink Group, were employed to produce the required SharePoint code.
Part of the project involved the conversion of various document formats into plain text. Additionally, a SharePoint web part needed to be designed to make search-by-keyword an easily requested option.
This conversion task can now be done by the Sun Open Office Suite server. At the time of this project, an application needed to be developed specifically for the Windows platform.
Ultimately, M.A.I.’s indexing word was done at the time a document was saved (or uploaded) in SharePoint. The option for the user to review the suggested keywords before they were ‘attached’ to the document as a custom property was implemented selectively. For most users, the keyword attachment was accomplished “behind the scenes”. For editors maintaining the taxonomy, the process is visible and interactive. In that way, the taxonomy elements are continually updated and improved as the language of the field evolves.