Creating an Open Source Genealogical Search Engine with Apache Solr

Creating an Open Source
Genealogical Search Engine
With Apache Solr

Brooke Schreier Ganz
info@leafseek.com
Twitter: @LeafSeek
www.LeafSeek.com

Hi, I‟m Brooke
• I make web stuff for fun, and (sometimes) for
profit
• Web Developer at IBM.com and Disney
Consumer Products
• Lead Programmer at TMZ.com (yikes, sorry about that)
• Senior Web Producer at Bravo cable TV
network and its spin-off websites
• Big dork
• Big genealogy dork
• #BigData dork

Meet Gesher Galicia
• Non-profit 501(c)3 genealogy society
• Founded in 1993
• Hundreds of members, worldwide
• E-mail discussion group
• New website development in progress
(existing website is fugly)
• Needs a search engine…for data

The New Problem
• Diverse Data Languages
(German, Polish, Ukrainian, Russian, Yiddi
sh, Hebrew, English…)
• Diverse Data Types
(births, marriages, deaths, divorces, tax
lists, landsmanschaften lists, industrial
permit lists, school
yearbooks, governmental yearbooks…)

Existing solutions
• They‟re okay...for small numbers of
databases, with small amounts of data

– Steve Morse's One-Step Tool Creator
– Roll-your-own solution with PHP and MySQL

• Both get more difficult to manage as data
sets increase in number and complexity

In space, no one can hear your data scream

To Sum Up
• There are lots of ways to publish your tree
• …but not so many ways to publish your
data
• Surely there must be a way to deal with
this?

So I Made A Thing
But “That Thing I Made With The Database And Stuff”
was kind of an awkward name, so I called it

LeafSeek

This is the part where I show you all
the shiny new All Galicia Database

http://search.geshergalicia.org/

Meet Apache Solr
• Highly functional open source search
platform
• Based on Apache Lucene (Java)…
• …plus a web wrapper/API
• Not the prettiest or simplest tool
• FREE and open source

solrconfig.xml

Make sure this part is configured, so you can
import data:

How to get your data into Solr
• Step 1: Make a properly-formatted
spreadsheet
• Step 2: Save spreadsheet as a .CSV file
• Step 3: Create a MySQL database + table
• Step 4: Import CSV into that new table
• Step 5: Add a Unique Auto-Incrementing
Primary Key called “id” (INT)
• Step 6: Add this table‟s information to
db-data-config.xml

db-data-config.xml
• Basic XML file that tells Solr how to grab
data from your MySQL database(s)
• Add new <dataSource> for new databases
• Add new <entity> for new tables within the
databases
• You need to make sure your MySQL
connector .jar is installed for this to work

schema.xml
• FieldTypes, Fields, and CopyFields
• FieldTypes give indexing and querying
instructions to “buckets”
• Fields say what‟s what and whether to
make something facetable or not
• CopyFields collect Fields together into
extra FieldTypes

schema.xml - FieldTypes
• 5 Custom FieldTypes (so far):
– givenname
– surname
– surname_bmpm (phonetic)
– place (note: not merely town)
– year (which we‟re treating as text right now)

schema.xml - Fields
• Uppercase fields come from the name of
the MySQL column name
• Examples:
– Year
– SchoolYear
– Surname
– FathersTown
– MothersFathersGivenName
– MaternalGrandfathersGivenName

schema.xml - Fields
• Lowercase fields were added once the
data is getting inputted to Solr, and start
with the prefix record_
• Examples:
– record_type (birth, death, tax, whatever)
– record_source (name of repository)
– record_latlong (latitude,longitude)
– record_id (required!)

schema.xml - Fields
• You do not have to explicitly define every
Field.
• If something is imported that is not named
and defined in schema.xml it will just be
indexed as a straight-up text string, with
nothing done to it.
• Which is fine.
• But IMHO it‟s better to define everything
anyway so you can remember what‟s what
and what you are doing to it.

Add-ons and nice-to-have‟s
(for the back-end)
• Wildcards, and lots of „em
• Non-name words handled through
stopwords.txt
• Nicknames and name synonyms handled
through synonyms.txt
• Two files included:
– synonyms_-_american-anglo-saxon.txt
– synonyms_-_polish-ukrainian-jewish.txt
• Should be based on your data and your
historical/ethnic community standards

More add-ons and nice-to-have‟s
(for the back-end)
• Translate your site into different languages – multi-
lingual content deserves a real multi-lingual
website
– Pass user preferences through GET value or through
accept-language header or read from a cookie or
whatever you want
• Built-in performance monitoring hooks for New
Relic
• Soundalike searches for surname variants
– Levenstein distance
– “Regular” Soundex, Metaphone, Caverphone, etc.

This is the part where I tell
the story about

THE SAGA
of Beider-Morse Phonetic Matching
(BMPM)

Relevancy
• Right now, we‟re using exact matches
• (Of course, “exact” includes
wildcards, alternate names /
synonyms, etc.)
• Like “Old Search” on Ancestry.com
• DisMax! Boosting fields! Scoring!
• (…but not yet)
• Problems with records with multiple
people‟s names in the record

Lots of Front-End Options
• Ruby:
Sunspot, RSolr, Tanning Bed, acts-as-solr
• Django/Python:
Haystack, Sunburnt, solrpy, pysolr
• Older PHP options:
PECL, solr-php-client
• Plugins for blog/CMS systems:
Drupal, WordPress

Meet Solarium
• http://www.solarium-project.org/
• New, open source PHP wrapper for Solr
• Very active development
• Version 2.4 coming soon

Meet Solarium: The Guts
• You choose the parts of your data to facet
• Data is submitted to the front-end by
POST, not by GET, so the URL never
changes
• You can (and should) paginate results
listings
• You can't actually see the Solr server's
URL from the front-end, not even in view-
source

(for the front-end)
• A welcome screen with information about
the database's contents
• Instructions (maybe twice)
• How many records in the database?
• How many datasets?
• What features are coming next?
• What datasets are coming next?

(for the front-end)
• Make good UI choices
• Pop-Up Google Maps
• Tooltips to reduce UI clutter
• Cross-browser compatibility
• Still stuck with IE 7 and 8
• CSS and code that degrades gracefully
• No small text

Bird‟s Eye View of Your Data
• What (surnames, towns, etc.) do I have in
my data?
• What are the TOP (surnames, towns, etc.)
in my data?
• Finding incorrect data
– Outlying years and dates
– Figure out that hard-to-read surname
• Make charts and graphs from your data

The (Back-End) Future! (Maybe.)

• Date ranges, instead of just years
• Auto-complete as you type
• “Did you mean...?”
(based on data frequency)
• “More Like This”
(would have to do scoring)
• Record bookmarking system (hashes?)

The (Front-End) Future! (Maybe.)

• Hierarchical facets for locations
• Disambiguating locations
• Social sharing of individual records
• New genealogy data schema
http://historical-data.org/
• Membership login system

Please Do Not Build That Wall
• Password protect some of the databases
• Password protect some of the data
• Open data, but pay for record or surname
bookmarking system
• Open data, but pay for API access
• Open data, but sell online ads
• Open data, but give people guilt trips

Presenting LeafSeek!
• Free and Open Source
• Code is all on GitHub
• Please add, edit, fix, change, tinker
• …and use it!

Why is this FREE?

And why is this important?

Creating an Open Source Genealogical Search Engine with Apache Solr

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (8)

En vedette

En vedette (17)

Similaire à Creating an Open Source Genealogical Search Engine with Apache Solr

Similaire à Creating an Open Source Genealogical Search Engine with Apache Solr (20)

Dernier

Dernier (20)

Creating an Open Source Genealogical Search Engine with Apache Solr