Search Intelligence & MarkLogic Search API

Search Intelligence &
MarkLogic Search API
MarkLogic World 2012
Will Thompson
wthompson@jonesmcclure.com

Search API Resources
• 5-minute Guide to the Search API
• MarkLogic Search Developer's Guide
• developer.marklogic.com
• MarkMail.org
• MarkLogic Developer Listserv

Code
Github:
https://github.com/wthoolihan/MLUC-2012-Examples

Search Intelligence
• Get the most out of our XML in search
– Approach 1: GUI

Search Intelligence
– Approach 2: Syntax

Search Intelligence
– Approach 3: Facets

Search Intelligence
– Approach 3:
Facets, constraints, filters

Search Intelligence
– Infer (Search Intelligence)

Enrich Your Query!
• Infer
– Use knowledge about the user
– Look for meaning in search terms
• Enrich
– Translate into more complex query
– Gain speed, accuracy

Enrich Your Query!
• Strategies
– Custom term handling
• Works well for single term transformations
• See: http://developer.marklogic.com/try/ninja/page13
– Roll your own parser
• A lot of work (see Michael Blakeley’s xqysp)
– Work between parse and search steps

Search API Overview
• The Search API is an XQuery library module designed to
simplify creating search applications:
o Parser
o Constraints
o Faceting
o Snippets
• High performance, scalability
• Extensible

Search API Extensibility
• Search API provides several points to hook in
• Hooks are defined in Search API options XML node
o Custom constraints
o Custom grammar
o Custom snippets
o Custom term handling
o Search operators

Search API Basics
• Search API module:
• Main entry point: search:search()
import module namespace search = "http://marklogic.com/appservices/search"
at "/MarkLogic/appservices/search/search.xqy";
• parses $qtext with given $options
• executes search
• returns <search:response>
o set of <search:result>s
o facets
o snippets
o metrics and other info

Search API Basics
• Search API
options:

• Snippet:
• Constraint:

• Term handler:
• Parser:
let $custom-parser-output :=
my:parse($qtext)
search:resolve(
$custom-parser-output,
$options
)

Search API Basics
• Search API parser:
• Execute search:
• 1st half of search:search()
• returns annotated cts:query XML
• 2nd half of search:search()
• accepts cts:query XML as input

search:parse() Strategy
1. Call search:parse()
2. Analyze and enrich the query XML
3. Call search:resolve()

Our Use Case
• O’Connor’s Online
– Search portal built on MarkLogic
– Legal rules and commentaries content
– Problem
• Users will enter citation numbers, abbreviations, etc. expecting
complete results
• Text editorial content follows different conventions
– Solution
• Detect special cases pre-search and enrich query

Example: detect year
• Content:
– MarkLogic database of news/op-ed articles
• Organized into year directories:
/content/1990
/content/1991
/content/1992
...
/content/2012
• Year is in directory structure, not article text
– But users will still include year in search terms

How to transform query?
• Recursive typeswitch
(function mapping on):
do-stuff-here($q)

let $terms := "1996 United States Olympics"
return local:detect-year(search:parse($terms))

• Strategy depends on your content model
• Other possibilities
– date detection
– date ranges
– locations
– etc.

search:parse() Strategy
• Weakness
– Limited to single word token
• Similar to custom term handling
• What about multiple tokens?
– Analyze querystring text directly using regex
• Dangerous
– Transform cts:query XML into intermediate form
• Preserve Boolean logic & grouping
• Preserve phrases
• Preserve constraints

Building Intermediate Query
• The hack
– Basically, undoing some of the parser's work
– Text "run" concept
• Similar to WordprocessingML

Building Intermediate Query
• Intermediate query strategy
1. Flatten query
2. Join sibling words in <run>
3. Transform <run>s
4. Convert <run>s back to word queries

Example: multi-word thesaurus
• Content:
– Same MarkLogic database of news/op-ed articles from
detect-year() example
• Query:
– Same as before: "1996 United States Olypmics"
– Start with the search:parse()output

• Intermediate query strategy
1. Flatten query
2. Join sibling words in <run>
3. Transform <run>s

1. Flatten query
– remove implicit and-queries from search:parse() output:

1. Flatten query
– XML should look more like cts:query string
representation:
cts:and-query(
(cts:word-query("1996", "lang=en", 1),
cts:word-query("United", "lang=en", 1),
cts:word-query("States", "lang=en", 1),
cts:word-query("Olympics", "lang=en", 1)),
())

1. Flatten query
• Typeswitch on
cts:and-query:
1. Check and-queries for
parent and-query
2. Remove the nested
ones, copy through
anything else

1. Flatten query
– Typeswitch function output:

2. Join sibling words in <run>:
• Typeswitch on cts:word-query:
1. Ignore phrases
2. Delete if query is
not the first.
3. Take first
word-query in
sequence and
join with its
following siblings
into a <run>

• Input:
– search:parse("1996 United States Olympics")/local:unnest-
ands(.)/local:create-runs(.)
• Output:

• Input:
– search:parse("1996 (sprint OR marathon) United States
Olympics")/local:unnest-ands(.)/local:create-runs(.)
• Output:

3. Transform <run>s:
1. Store terms in thesaurus
2. Build cts:or-query of thesaurus terms
3. Using cts:or-query of terms, cts:highlight() <run>s,
and replace with thesaurus synonyms

1. store terms in
thesaurus

2. build cts:or-query of thesaurus terms:

3. replace matches with synonyms:
– cts:highlight() - powerful cts:query-based find/replace
»
»

3. replace matches with synonyms:

Input:
let $q-thsr :=
cts:or-query(
doc("thesaurus.xml")
//thsr:entry/thsr:term/cts:word-query(string(.)))
)
let $q-runs :=
search:parse("1996 United States Olympics")
/local:unnest-ands(.)/local:create-runs(.)
return local:thsr-expand($runs, $q-thsr)

Output:

– Typeswitch:

Input:
let $q-thsr :=
cts:or-query(
doc("thesaurus.xml")
//thsr:entry/thsr:term/cts:word-query(string(.)))
)
let $runs := search:parse("1996 United States Olympics")
/local:unnest-ands(.)/local:create-runs(.)
let $expanded := local:thsr-expand($runs, $q-thsr)
return local:resolve-runs($expanded)

Output:

Combining Examples
local:thsr-expand-runs($runs, $q-thsr)
/local:resolve-runs($expanded)/local:detect-year($runs)

Enrich Your Query!
• Takeaway
1. No added GUI
2. Didn't ask the user for additional input
3. Able to build more robust query before
executing search

• Many potential applications:
– Ad-hoc weighting:
Search API Hacking
local:q-add-weights(
search:parse("bananas"),
(<element ns="$ns" name="p" weight="1"/>,
<element ns="$ns" name="b" weight="2"/>,
<element ns="$ns" name="title" weight="3.5"/>)
)

– Automatic spell correction:
Search API Hacking

– Detect entities
• Transform text into element-based query
• Less false positives and exclusions
• Leverage indexes:
Search API Hacking
"New York Times"

Search API Hacking
• Other ideas
– Regex unparsed query string
• apply constraints, operators, etc as configured in Search API based on key
words/patterns
– Custom term handler
• single-term transformations
– Combine with data enrichment on ingestion
• MarkLogic Entity Framework
• Linguistic processing

Hazards
• Chaos
– Daisy chained transformations can have unintended
consequences
– Performance
• Pre-search transformations need to be fast
• make sure to leverage indexes as much as possible
• Larger queries do take longer

Search Intelligence & MarkLogic Search API

Recommandé

Recommandé

Contenu connexe

Similaire à Search Intelligence & MarkLogic Search API

Similaire à Search Intelligence & MarkLogic Search API (20)

Dernier

Dernier (20)

Search Intelligence & MarkLogic Search API