2. 3 years in Cognifide – exactly today
Senior software engineer & technical lead
Focused on systems integration tasks
The ”search guy” in Cognifide
Who am I?
3. What we won’t talk about
Sorting
Document
structure
Indexing
Managed
relevancy
model
Input data
processingHighlighter
Faceted
search
Wildcard
search
Statistics
Autocomplete
Spellchecking
Lemmatization
Sentence
search
Pagination
Content
normalization
Metadata
Data
collections
& views
5. „What is the best British football team?”
If we ask such a question, will the search engine find the answer?
The goal of searching
6. „What is the best British football team?”
The search engine will find the question, not the answer.
The goal of searching
7. „What is the best British football team?”
vs.
„best team football UK”
Are we asking questions or issuing queries?
The goal of searching
8. The goal of searching
Effective searching is about finding keywords:
• in the shortest possible time
• close to each other in a block of text
• that are in a desired context
and being sure the engine knows about the data we are looking for!
11. Microsoft FAST
The first major external search integration with AEM (then: CQ 5.4)
in Cognifide.
Push-like indexing using CQ-FAST connector from Adobe.
12. Microsoft FAST
Implemented as a dedicated replication agent, triggered by the
content replication.
http://wem.help.adobe.com/enterprise/en_US/10-0/wem/administering/cq2fast.html
17. Microsoft FAST
Sends content to MS FAST.
The ”cq5” suffix in the URI is
a document collection.
A named subset of documents
in the entire FAST index.
http://wem.help.adobe.com/enterprise/en_US/10-0/wem/administering/cq2fast.html
19. Microsoft FAST
The replication agent is OK for one site, stored in a single FAST
collection of documents.
It becomes complicated in the multi-site environment where each
site must be located in a separate index area.
And when the search results should not contain data coming from
the different sites.
21. Microsoft FAST
The complex ACL configuration has been used to ensure that only
one proper agent will deliver the document to FAST.
It was hard to set and maintain without the proper tools that have
automated the whole process.
23. Google Search Appliance
For the AEM & GSA integration, we have considered reusing of the
CQ-FAST connector approach.
But aware of the issues, we have decided to develop our own
micro-framework that takes care about the indexing process.
Installed as a single OSGi bundle.
Provides a set of services and utilities to help with the indexing.
24. Google Search Appliance
Content
replication
Filtering
Push to
Publish
Indexing
queue (-s)
Content
gathering
Metadata
processing
Push to
external
engine
The indexing process
spans between the
author and the publish
AEM instances.
All stages are tracked
and it is possible to
recover from the failure
and retry the indexing.
AuthorPublish
Process status tracking & persistence
25. Google Search Appliance
Content
replication
Filtering
Push to
Publish
Indexing
queue (-s)
Content
gathering
Metadata
processing
Push to
external
engine
The process starts with
the content replication.
OR
Programatically from the
backend, e.g. triggered
by the scheduler service.
AuthorPublish
Process status tracking & persistence
26. Google Search Appliance
Content
replication
Filtering
Push to
Publish
Indexing
queue (-s)
Content
gathering
Metadata
processing
Push to
external
engine
Each replicated content
path is filtered against
a whitelist & a blacklist.
There’s an option to use
a custom OSGi service
able to decide if the
content should be
indexed, removed or
ignored.
AuthorPublish
Process status tracking & persistence
27. Google Search Appliance
Content
replication
Filtering
Push to
Publish
Indexing
queue (-s)
Content
gathering
Metadata
processing
Push to
external
engine
The indexing information
is persisted in a special
kind of repository node
and replicated to the
publish instance.
We can choose which
publish instance(-s) will
receive the data.
AuthorPublish
Process status tracking & persistence
28. Google Search Appliance
Content
replication
Filtering
Push to
Publish
Indexing
queue (-s)
Content
gathering
Metadata
processing
Push to
external
engine
The information is
received and instantly
dispatched to the
indexing queue(-s).
We can handle indexing
in a single or multiple
different search engines.
AuthorPublish
Process status tracking & persistence
29. Google Search Appliance
Content
replication
Filtering
Push to
Publish
Indexing
queue (-s)
Content
gathering
Metadata
processing
Push to
external
engine
The content is gathered
using the
SlingRequestProcessor
OSGi service.
It’s like a request for an
HTML page sent from
the Java code and
consumed by itself.
AuthorPublish
Process status tracking & persistence
30. Google Search Appliance
Content
replication
Filtering
Push to
Publish
Indexing
queue (-s)
Content
gathering
Metadata
processing
Push to
external
engine
Metadata is collected
according to multiple
different rules:
• the content resource
type
• the content path
• values of the
component properties
• custom rules
AuthorPublish
Process status tracking & persistence
31. Google Search Appliance
Content
replication
Filtering
Push to
Publish
Indexing
queue (-s)
Content
gathering
Metadata
processing
Push to
external
engine
The content and
metadata are combined
together and sent to the
search engine.
Depending on the
implementation it can be
done for each single
document or in batches.
AuthorPublish
Process status tracking & persistence
32. Google Search Appliance
Content
replication
Filtering
Push to
Publish
Indexing
queue (-s)
Content
gathering
Metadata
processing
Failure or
timeout
Retry
In case of any failure,
indexing is rescheduled
and launched again as
many times as it is
configured.
If the server goes down,
indexing will restart
when the machine is up
again.
AuthorPublish
Process status tracking & persistence
35. Apache Solr
The search engine, which is:
• free & open source
• powerful
• customizable
• scalable
And what is the most important, it is a part of the Jackrabbit Oak
(JCR 3), the repository engine which has been used for AEM 6.
AEM with the integrated Solr is right there.
36. Apache Solr
The solution developed for GSA has been ported to work with Solr.
Changes:
• Replaced the ”glue code” that does the final data push, with
one that uses SolrJ Java library.
• Names of the document metadata fields has been changed to
follow the Solr naming convention for dynamic fields.
Everything else remained untouched.
38. Search driven components
No server-side processing.
Search engine used as a mini database of metadata.
Configuration via query parameters.
Pure front-end implementation.
39. Search driven components
The whole page can be read from
the dispatcher cache.
An AJAX request gets the content
directly from the search engine.
The response is JSON-structured, easy to parse and to display,
using JavaScript.
{
"id": "223344",
"firstName": "Michael",
"lastName": "Johnson",
"phone": "(123)-777-8888",
"office": "Office UK",
"department": "504",
"title": "Lead Architect"
}
41. Search driven components
User profile.
The name, mobile,
email, image path etc.
are all metadata values
of the document.
42. Search driven components
Carousel with news.
By changing the
maximum number
of search results,
we can control the
number of slides in
the carousel.