Set Your Records Free!
LeafSeek is a new tool that helps you turn your genealogical or historical record collections into searchable online databases. Combine multiple datasets of different types — such as birth, marriage, and military records — into one unified searchable website. Find inter-connections in your data that you never noticed before.
With great features like built-in geo-spatial searches, pop-up Google Maps, Beider-Morse Phonetic Matching, name synonyms, and language localization, LeafSeek can help you turn your spreadsheets of names and dates into a full-featured genealogy search engine. It’s designed for researchers and genealogy societies alike.
Oh, and one more thing: LeafSeek is free and open source. No strings attached.
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Creating an Open Source Genealogical Search Engine with Apache Solr
1. Creating an Open Source
Genealogical Search Engine
With Apache Solr
Brooke Schreier Ganz
info@leafseek.com
Twitter: @LeafSeek
www.LeafSeek.com
2. Hi, I‟m Brooke
• I make web stuff for fun, and (sometimes) for
profit
• Web Developer at IBM.com and Disney
Consumer Products
• Lead Programmer at TMZ.com (yikes, sorry about that)
• Senior Web Producer at Bravo cable TV
network and its spin-off websites
• Big dork
• Big genealogy dork
• #BigData dork
3. Meet Gesher Galicia
• Non-profit 501(c)3 genealogy society
• Founded in 1993
• Hundreds of members, worldwide
• E-mail discussion group
• New website development in progress
(existing website is fugly)
• Needs a search engine…for data
10. The New Problem
• Diverse Data Languages
(German, Polish, Ukrainian, Russian, Yiddi
sh, Hebrew, English…)
• Diverse Data Types
(births, marriages, deaths, divorces, tax
lists, landsmanschaften lists, industrial
permit lists, school
yearbooks, governmental yearbooks…)
14. Existing solutions
• They‟re okay...for small numbers of
databases, with small amounts of data
– Steve Morse's One-Step Tool Creator
– Roll-your-own solution with PHP and MySQL
• Both get more difficult to manage as data
sets increase in number and complexity
16. To Sum Up
• There are lots of ways to publish your tree
• …but not so many ways to publish your
data
• Surely there must be a way to deal with
this?
17.
18.
19. So I Made A Thing
But “That Thing I Made With The Database And Stuff”
was kind of an awkward name, so I called it
LeafSeek
20. This is the part where I show you all
the shiny new All Galicia Database
http://search.geshergalicia.org/
21. Meet Apache Solr
• Highly functional open source search
platform
• Based on Apache Lucene (Java)…
• …plus a web wrapper/API
• Not the prettiest or simplest tool
• FREE and open source
34. How to get your data into Solr
• Step 1: Make a properly-formatted
spreadsheet
• Step 2: Save spreadsheet as a .CSV file
• Step 3: Create a MySQL database + table
• Step 4: Import CSV into that new table
• Step 5: Add a Unique Auto-Incrementing
Primary Key called “id” (INT)
• Step 6: Add this table‟s information to
db-data-config.xml
35.
36.
37. db-data-config.xml
• Basic XML file that tells Solr how to grab
data from your MySQL database(s)
• Add new <dataSource> for new databases
• Add new <entity> for new tables within the
databases
• You need to make sure your MySQL
connector .jar is installed for this to work
41. schema.xml
• FieldTypes, Fields, and CopyFields
• FieldTypes give indexing and querying
instructions to “buckets”
• Fields say what‟s what and whether to
make something facetable or not
• CopyFields collect Fields together into
extra FieldTypes
42. schema.xml - FieldTypes
• 5 Custom FieldTypes (so far):
– givenname
– surname
– surname_bmpm (phonetic)
– place (note: not merely town)
– year (which we‟re treating as text right now)
46. schema.xml - Fields
• Uppercase fields come from the name of
the MySQL column name
• Examples:
– Year
– SchoolYear
– Surname
– FathersTown
– MothersFathersGivenName
– MaternalGrandfathersGivenName
47. schema.xml - Fields
• Lowercase fields were added once the
data is getting inputted to Solr, and start
with the prefix record_
• Examples:
– record_type (birth, death, tax, whatever)
– record_source (name of repository)
– record_latlong (latitude,longitude)
– record_id (required!)
48. schema.xml - Fields
• You do not have to explicitly define every
Field.
• If something is imported that is not named
and defined in schema.xml it will just be
indexed as a straight-up text string, with
nothing done to it.
• Which is fine.
• But IMHO it‟s better to define everything
anyway so you can remember what‟s what
and what you are doing to it.
51. Add-ons and nice-to-have‟s
(for the back-end)
• Wildcards, and lots of „em
• Non-name words handled through
stopwords.txt
• Nicknames and name synonyms handled
through synonyms.txt
• Two files included:
– synonyms_-_american-anglo-saxon.txt
– synonyms_-_polish-ukrainian-jewish.txt
• Should be based on your data and your
historical/ethnic community standards
52.
53.
54. More add-ons and nice-to-have‟s
(for the back-end)
• Translate your site into different languages – multi-
lingual content deserves a real multi-lingual
website
– Pass user preferences through GET value or through
accept-language header or read from a cookie or
whatever you want
• Built-in performance monitoring hooks for New
Relic
• Soundalike searches for surname variants
– Levenstein distance
– “Regular” Soundex, Metaphone, Caverphone, etc.
55. This is the part where I tell
the story about
THE SAGA
of Beider-Morse Phonetic Matching
(BMPM)
56. Relevancy
• Right now, we‟re using exact matches
• (Of course, “exact” includes
wildcards, alternate names /
synonyms, etc.)
• Like “Old Search” on Ancestry.com
• DisMax! Boosting fields! Scoring!
• (…but not yet)
• Problems with records with multiple
people‟s names in the record
64. Meet Solarium: The Guts
• You choose the parts of your data to facet
• Data is submitted to the front-end by
POST, not by GET, so the URL never
changes
• You can (and should) paginate results
listings
• You can't actually see the Solr server's
URL from the front-end, not even in view-
source
65. Add-ons and nice-to-have‟s
(for the front-end)
• A welcome screen with information about
the database's contents
• Instructions (maybe twice)
• How many records in the database?
• How many datasets?
• What features are coming next?
• What datasets are coming next?
66. Add-ons and nice-to-have‟s
(for the front-end)
• Make good UI choices
• Pop-Up Google Maps
• Tooltips to reduce UI clutter
• Cross-browser compatibility
• Still stuck with IE 7 and 8
• CSS and code that degrades gracefully
• No small text
67. Bird‟s Eye View of Your Data
• What (surnames, towns, etc.) do I have in
my data?
• What are the TOP (surnames, towns, etc.)
in my data?
• Finding incorrect data
– Outlying years and dates
– Figure out that hard-to-read surname
• Make charts and graphs from your data
68.
69. The (Back-End) Future! (Maybe.)
• Date ranges, instead of just years
• Auto-complete as you type
• “Did you mean...?”
(based on data frequency)
• “More Like This”
(would have to do scoring)
• Record bookmarking system (hashes?)
70. The (Front-End) Future! (Maybe.)
• Hierarchical facets for locations
• Disambiguating locations
• Social sharing of individual records
• New genealogy data schema
http://historical-data.org/
• Membership login system
71.
72. Please Do Not Build That Wall
• Password protect some of the databases
• Password protect some of the data
• Open data, but pay for record or surname
bookmarking system
• Open data, but pay for API access
• Open data, but sell online ads
• Open data, but give people guilt trips
73. Presenting LeafSeek!
• Free and Open Source
• Code is all on GitHub
• Please add, edit, fix, change, tinker
• …and use it!