Presented by Engy Ali | The Library of Alexandria See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
Do you have a large collection of text content that you want to search? Facing challenges on how to facet after performing a full text search across metadata and content? Do you want to use Solr with personalization? Bibliotheca Alexandrina provides public access to digitized book collections that exceed 220,000 books, through a web-based search and browsing facility. The facility is completely built on Solr in five different languages. The website provides full text morphological search within the books’ metadata and content with result highlighting. Different personalization features like annotation tools and tagging are also implemented using Solr. This presentation will cover how Bibliotheca Alexandrina uses Solr to implement full text indexing and searching across the entire collection, faceting, search within the content of a book and result highlighting and techniques used for personalization.
Apidays New York 2024 - The value of a flexible API Management solution for O...
How to Access Your Library Book Collections Using Solr
1. Accessing
Your
Library
Book
Collec5ons
Using
Solr
By: Engy Morsy
Software project manager, Bibliotheca Alexandrina
engy.morsy@bibalex.org
5/14/12
h(p://dar.bibalex.org
1
12. Book
site
• Approximately
260,000
books
• Nearly
220,000
books
published
online
• About
1.5
TB
of
content
• Average
book
size
6
MB
• Daily
indexing
rate
is
about
150
books.
5/14/12
h(p://dar.bibalex.org
12
13. What
do
we
want…?
• Allow
simple
and
advanced
search
across
metadata
and
content
in
5
languages
5/14/12
h(p://dar.bibalex.org
13
15. What
do
we
want…?
• Allow
simple
and
advanced
search
across
metadata
and
content
in
5
languages
• FaceFng
5/14/12
h(p://dar.bibalex.org
15
16.
17.
18.
19.
20. What
do
we
want…?
• Allow
simple
and
advanced
search
across
metadata
and
content
in
5
languages
• FaceFng
• AnnotaFons
5/14/12
h(p://dar.bibalex.org
20
25. What
do
we
want…?
• Allow
simple
and
advanced
search
across
metadata
and
content
in
5
languages
• FaceFng
• AnnotaFons
• PersonalizaFon
5/14/12
h(p://dar.bibalex.org
25
32. Book
site
indices
Query
AR
EN
FR
IT
SP
Index
Index
Index
Index
Index
5/14/12
h(p://dar.bibalex.org
32
33.
Indexing
Book
CollecFon
• Index
per
language
• A
Document
in
the
content
index
correspond
to
a
page
in
a
book
• Maintain
a
field
to
disFnguish
between
metadata
record
and
content
record
(e.g.
SolrType)
• Use
staFc
fields
for
all
content
index
(e.g.
PageID..etc)
5/14/12
h(p://dar.bibalex.org
33
34. What
is
the
problem
with
this
solu5on?
5/14/12
h(p://dar.bibalex.org
34
35. Problem
for
content
search
Example
:
Advanced
Search
search
for
Title:
Mobile
Technology
And
Content
:
“cloud
compuFng”
5/14/12
h(p://dar.bibalex.org
35
36. Proposed
soluFon
SolrType
Title:
Mobile
Result
Technology
..
index
IDs
Meta
Get
Final
intersecFon
..
index
result
Content
:
SolrType
Facet
Parent
Book
IDs
“cloud
..
index
result
compuFng”
Content
5/14/12
h(p://dar.bibalex.org
36
37. The
problem
is…
• Can’t
get
the
faceFng
result
directly
from
the
content
index
• Need
to
query
the
metadata
index
in
order
to
get
the
final
facet
result
processing
Fme!!!
5/14/12
h(p://dar.bibalex.org
37
38. SoluFon…!
• Metadata
denormalizaFon
– Denormalize
metadata
into
content
index
5/14/12
h(p://dar.bibalex.org
38
39. Proposed
soluFon
SolrType
Title:
Mobile
Result
Technology
..
index
IDs
Meta
Get
Final
intersecFon
result
Content
:
SolrType
“cloud
Facet
..
index
result
compuFng”
Content
5/14/12
h(p://dar.bibalex.org
39
40. Problem
for
content
search
• Metadata
denormalizaFon…..
Worst
choice!
• Re-‐indexing
for
changes
in
metadata
• Data
processing
is
required.
5/14/12
h(p://dar.bibalex.org
40
42. Indexing
Metadata
• Index
per
language
• Separate
content
and
metadata
index
•
Text
field
holds
the
whole
book
content
in
the
metadata
index
– The
maxFieldLength
has
been
set
to
maximum.
• e.g:
2147483647
5/14/12
h(p://dar.bibalex.org
42
43. Back
to
the
example
Example
:
Advanced
Search
search
for
Title:
Mobile
Technology
And
Content
:
“cloud
compuFng”
5/14/12
h(p://dar.bibalex.org
43
44. SoluFon
Title:
Mobile
Technology
Meta
Facet
index
result
Content
:
“cloud
compuFng”
5/14/12
h(p://dar.bibalex.org
44
45. soluFon
Title:
Mobile
Technology
Meta
index
Get
Meta
Facet
intersecFon
index
result
Content
:
“cloud
Content
compuFng”
index
5/14/12
h(p://dar.bibalex.org
45
46.
Separate
indexes
Vs.
All
in
one
• Separate
indexes
+ Indexing
Fme
+ Index
size
-‐ Processing
results
(facets..)
-‐ Scoring
5/14/12
h(p://dar.bibalex.org
46
47.
Separate
indexes
Vs.
All
in
one
• Separate
indexes
+ Indexing
Fme
+ Index
size
-‐ Processing
results
(facets..)
-‐ Scoring
• One
index
– Index
size
– Indexing
Fme
+ Scoring
+ Processing
Fme
5/14/12
h(p://dar.bibalex.org
47
48. Book
content
index
AR
EN
FR
IT
SP
Index
Index
Index
Index
Index
5/14/12
h(p://dar.bibalex.org
48
50. Searching
• Simple
and
advanced
search
– Cache
the
resulted
IDs
only
• HighlighFng
search
result
– Get
the
full
search
result
and
highlight
per
page
result
5/14/12
h(p://dar.bibalex.org
50
51. Book
Content
Search
• Search
using
– Search
query
– Book
ID
– List
of
pages’
IDs
• Highlights
• AnnotaFons
– Saved
currently
in
DB
5/14/12
h(p://dar.bibalex.org
51
52. FaceFng
• Fixed
facet
fields
– Category,
sub-‐category,
language..etc.
– Stored,
indexed,
exact
fields
• Process
facets
from
different
indices
5/14/12
h(p://dar.bibalex.org
52
53. PersonalizaFon
• Using
separate
index
of
personalizaFon
– Different
Solr
fields
for
different
languages.
– Search
across
all
fields.
• Saving
in
both
Solr
and
DB
• Indexing
tags,
raFng
and
comments
using
type
field
5/14/12
h(p://dar.bibalex.org
53
54. Future
• Book
mobile
applicaFon
using
Solr
• Using
Hadoop
• Indexing
other
digital
media
(Maps,
audio,
video)
5/14/12
h(p://dar.bibalex.org
54