4. Metadata basics
What is it? Where is it stored?
Metadata is the set of properties that characterize a document.
5. Poor metadata impairs the search experience
Degraded findability leads to the erosion of users’ trust in search
Few options to navigate
Inconsistent, incorrect or missing I’m not confident I will
find what I need here…
or refine a large result
list other than trying to
metadata is commonplace within This is a waste of time!
reformulate the query
most organizations today
This impairs findability in the context
of enterprise search
Hard to scan or navigate results
Documents returned may be Unchanged template
metadata make results
look like duplicates
incomplete or not current
No confidence in authority and
correctness of information
Difficult to locate relevant experts
Meaningless metadata
confuses users as they
Even with refinement scan the search results
tools, users do not
rely on them
Multiple variations
or spellings Missing metadata raises
questions about result
Hit counts do not
set completeness
add up
6. ROI - Scenarios
1. Time Wasted Searching
2. Cost of Reworking Information
3. Opportunity Costs to the Enterprise
6 | SharePoint Server 2010 for Internet Sites Microsoft confidential.
7. Scenario 1: Time wasted
€3.000/month + social €50.000/year
10 minutes/day *220
€1.000/emp/year
1000 employees = €1.000.000/year ”released time”
7 | SharePoint Server 2010 for Internet Sites Microsoft confidential.
8. Creating quality metadata is a real challenge
Few organizations have good quality metadata on internal content
• Ineffective information governance across the enterprise
• Multiple content silos and search interfaces
Challenge • Manually entered metadata is inconsistent, incorrect or missing
• No automated tools for content classification
• Impossible to keep up with ever growing content volumes
Assist users in tagging
content with automated
metadata suggestions
or enrichment tools
• FAST Search for SharePoint (FS4SP) delivers business value out-of-the-box
• Sophisticated content processing optimizes findability across multiple silos Solution
of unstructured and structured content
• In addition, property extraction overcomes poor metadata by generating it
and normalizing it on-the-fly
10. Content Processing Pipeline – what is it?
Enhance your content for optimal search experience and findability
The pipeline is a sequentially arranged set of discrete processing
stages that break down and enrich content for indexing
Convert documents to plain text (support for 400+ file formats)
Detect document languages and encoding (support for 80+ languages)
Apply linguistic normalization to optimize content for search
Identify and leverage existing metadata where applicable
Parse content to extract or generate additional metadata
Map content and associated metadata (crawled properties) to the index
schema (managed properties) for searching
Custom stages can be created and added to the pipeline
Language
Custom Identifies the encoding and language-specific rules for
Breaks you to tokens entities mentioned
Applies document times (phrase/weight inin pipeline
Recognizes predefined usinga standard topairsthe so
Createstext into andvectors content processingthe content;
Converts dates extend the normalizationrepresentation,
Enables language-specific tolanguages usedcontenttext to
reflecting
Date and Time
Properties
Property
Format Extracts plain text pieces of content and metadata
Maps the relevant and metadata from multiple content
Lemmatization
Encoding and
Vectorization
Tokenization
Processing content so that the (home-grown occurrence) 3rd party
punctuation, support for and phrases
users’ locale-specific accents, linguisticin words,enable
out custom stages appropriate of solutionscanonical
important terms and frequency compoundexample, the
handlethe box match wordsCompanies, Locations andor
withof queriesdiacritics,representations; fornormalization
or to phrases
Normalization
Conversion
Extraction
Mapper formats (e.g. the pipeline to the index schema
discovered inMicrosoft Office, PDF, HTML, etc.) for search
Detection
Stage rules and to (currency, telephones, downstream
and similar”address extended to other 2010
inflected dictionariesyour
People but this is equivalent to March numbers, etc.)
“find 14-Mar-10can becan be appliedpartneeds
datenumbers functionality own masculine/feminine,
software)forms (singular/plural,business14,categories etc.)
11. Property Extraction
Create metadata on-the-fly, adding structure to unstructured content
In a nutshell, property extraction Crawled Properties
is the ability to Companies
Process unstructured content (e.g. Microsoft
Contoso
a document’s body) Woodgrove
Recognize entities mentioned in …
the text (e.g. people, companies, Locations
locations, concepts, etc.) London
San Francisco
Optionally, normalize variations to Moscow
a single, canonical form …
Expose these extracted entities as People
crawled properties in pipeline Bill Gates
Barack Obama
Map them to managed properties José Caires
for filtering and searching ...
Index Schema: Managed Properties
Type Doc ID Title Author Date Size Keywords Companies Locations People ... Body Text
xxx Sales For… John Doe 2010-04-15 386 KB sales; pipe… Microsoft; … London; … Bill Gates; … … The mark…
yyy … … … … … … … … … …
zzz … … … … … … … … … …
12. Good metadata greatly improves findability
Property extraction enables consistent metadata across all content
This is really great! Now I
Metadata quality is critical to can navigate through this
Metadata is also used
for relevancy tuning,
the search experience large information universe multi-level sorting and
advanced search
without feeling lost…
FS4SP leverages metadata,
i.e. managed properties, to
present deep refiners File Formats ,
Offer at-a-glance overview
Organize free-text search
results into multiple facets
Companies
Make search conversational
Guide users toward possible Precise hit counts in
refinement choices deep refiners are
computed across the
whole result set.
Prevent users drilling down Products
into a “0 results” dead end
Additional uses for managed
properties in FS4SP
Relevancy tuning & ranking Concepts
Multi-level sorting
Advanced (or fielded) search
And many more…
13. The Microsoft IT Intranet
Environment
6.4 TB
49,731 Sites
Seattle Dublin 117,324 Sub-sites
29.89 TB 22%
65%
( 31,346,042 MB )
Grows with 1.5TB per quarter
Singapore
223,595 Sites 4.1 TB
19.4 TB 545,387 Sub-sites 45,878 Sites
13%
127,986 Sites 82,128 Sub-sites
345,935 Sub-sites
- Europe - Middle East -
- Americas - - Africa - - Asia Pacific -
As of September 2010
| 13
16. Property extraction and refiners in FS4SP
What’s available out-of-the-box?
FS4SP automatically detects 80+
languages in content
Property extraction dictionaries are
included for 11 languages* and 3
types of entities
Locations
Companies
Persons
The metadata is exposed to users as
refiners, drives relevancy and other
features to improve findability
This delivers real business value to
organizations struggling with issues
such as
Poor document metadata
Large content volumes
Lack of result refinement options
Low user adoption of search
* Arabic, Dutch, English, French, German, Italian, Japanese, Norwegian, Portuguese, Russian, Spanish
17.
18. Extending property extraction in FS4SP (1/2)
Make search speak the language of your business using dictionaries
Property extraction in FS4SP is SharePoint lists & Term Store
customizable using a dictionary,
i.e. list of keywords and phrases
Matching variations can be
normalized to a single entry
Several dictionaries may co-exist
to address needs of the business
Projects
Create custom
Products search refiners
to fit your own
Customers business needs
Competitors
Employees
Business-specific concepts
The necessary data may be readily
available within the organization
or from external sources
LOB applications, Databases & XML
19. Extending property extraction in FS4SP (2/2)
Use existing text mining or classification tools to go even further
Another approach is to invoke External text mining/classification tool
external tools during content
processing in FS4SP
This leverages the standard
pipeline extensibility mechanism Local software Web service
Such tools typically address
problems like Analyze
text content
Text mining for entity, fact or
relationship extraction Return
metadata tags
Taxonomy classification
Moreover, these tools may be
already deployed for other Index
purposes in the enterprise
Content pipeline
Enriched document
Home-grown solutions for indexing
?
3rd party, specialized vendors
Industry sectors or verticals Original document
from repository
Scientific or technology domains
21. Best practice #1
Deepen your understanding of your audiences and your content
Marketing Sales Consulting Procurement Production Research IT Support HR / Legal
Enterprise
content
Before you start deploying enterprise search:
understand your content, your users and what
they need to get their jobs done effectively.
22. Best practice #2
Use existing language resources inside and outside your enterprise
•Thesauri, controlled •Government
Internal assets
Internet resources
Content providers
Specialized vendors
vocabularies agencies
•Taxonomies, •Industry bodies
ontologies •Research
institutions
•Master databases
•Academia
•Enterprise systems
•Virtual
•Line-of-business communities
applications
•Examples
•Subject matter
experts •Wikipedia.org
•DBpedia.org
•Examples* •WordNet, from
•SharePoint (Lists, Princeton University
Term Store) •Medical Subject
•Employees (AD, HR) Headings (MeSH)
•Customers (CRM)
•Suppliers (ERP)
•Products (PLM)
•Processes (BPM)
•Projects (EPM)
* AD – Active Directory; CRM – Customer Relationship Mgmt.; ERP – Enterprise Resource Mgmt.; PLM – Product Lifecycle Mgmt.; BPM – Business Process Modeling; EPM – Enterprise Project Mgmt.
23. Best practice #3
Keep the index synchronized with content sources and dictionaries
The language of the business Where possible, automate
will change over time dictionary upkeep as part of
External environment standard business workflows
Enterprise content Taxonomies and thesauri
Users’ needs Enterprise project management
Ensure that property extraction Product lifecycle management
dictionaries and search index Schedule regular analysis and
are systematically updated to review checkpoints to handle
respond to these changes exceptional cases
Dictionary
with changes over time
Data
Search synchronized
Sources
Property
Extraction
Dictionaries
Search
Index
Enterprise
Content
Sources
24. Best practice #4
Distinguish search management from systems management
As the language of your business Search management is not an IT
and users’ needs evolve, so should responsibility, it’s for the business
your search solution
Job profile
If not, the search experience and
• Skillset of a SharePoint administrator (not a
findability inevitably degrade over programmer or systems engineer)
time – users’ trust will plunge too • Business perspective and focus
• Good ability with languages
• Attention to detail
Original implementation Sample tasks
of the search solution
• Monitor search reports (daily/weekly)
• Run user polls and/or focus groups
(quarterly)
• Process users feedback/questions
• Update dictionaries and manage keywords
(as required)
• Support search-related projects
Staffing – depends on scale
Actual search experience,
if left unattended... • One person part-time, or
• A geographically distributed team
26. Case study #1
General Mills (Research & Development)
Business Problem
• Researchers forced to search each internal and
external content source separately
• Low relevancy in existing search applications
• High effort in information discovery tasks
• Growing difficulty in establishing connections with
experts as company grew worldwide
Approach & Solution
• FAST Search for SharePoint indexes all internal
sources and federates external industry services
• Property extraction dictionaries extended to
recognize product names cited in documents
• Deep refiners are used on extracted properties to
drill down by products, companies and people
Benefits & Value
• Improved employee productivity with more relevant
search results in a unified interface
• Greater information sharing and reuse across
product areas & geographies By using FAST Search Server 2010 for SharePoint, our
• Integrated people search eases social networking researchers can refine their searches and find exactly what
they are looking for. They spend more time innovating than
• Proof point for wider search roll-out in enterprise looking for information.
Link to full case study
– Michelle Check, R&D Systems Leader, General Mills
27. Case study #2
Mississippi Department of Transportation (MDOT)
Business Problem
• Poor access to a large, active collection of paper-
based contracts and project documents
• Metadata managed in a separate DMS (database)
• Information silos stifle and sharing of data and
collaboration
• Requirements to provide internal and public access
Approach & Solution
• FAST Search for SharePoint indexes images with
iFilter-based OCR technology
• Pipeline extended with custom .NET code to merge
metadata from database with indexed documents
• Custom refiners reflect language used in the
business for navigating search results
Benefits & Value
• Unified self-service interface to locate information
• Ability to slice & dice results according to specific
needs (dates, project, folder, route, district, etc.) We are literally reducing decision cycles from days to
• Information search times cut from several hours or minutes for hundreds of overlapping decisions a day. With
days to mere seconds or minutes SharePoint Server 2010, we can make better spending
decisions and enhance program performance without a very
• Users have more time to focus on higher value tasks large investment.
Link to full case study – John Michael Simpson, CTO, MDOT
28. Ingredients for great enterprise search
The business value of FAST Search Server 2010 for SharePoint
The challenges
• Explosive content growth puts information management and
governance under pressure
• Multiple content silos with different search interfaces
• Poor metadata – missing, inconsistent, incorrect
The solution
• Content processing optimizes findability across disparate sources
• Property extraction generates metadata while indexing content
• Deep refiners expose metadata in search results helping users
quickly zoom to the right information
The benefits
• Reduced costs through enterprise search consolidation and
automated metadata enrichment
• Enhanced findability helps employees to get their job done faster
• Increased user adoption across the enterprise drives ROI