SlideShare une entreprise Scribd logo
1  sur  77
Creating an Open Source
Genealogical Search Engine
    With Apache Solr


              Brooke Schreier Ganz
                 info@leafseek.com
                Twitter: @LeafSeek
                www.LeafSeek.com
Hi, I‟m Brooke
• I make web stuff for fun, and (sometimes) for
  profit
• Web Developer at IBM.com and Disney
  Consumer Products
• Lead Programmer at TMZ.com (yikes, sorry about that)
• Senior Web Producer at Bravo cable TV
  network and its spin-off websites
• Big dork
• Big genealogy dork
• #BigData dork
Meet Gesher Galicia
• Non-profit 501(c)3 genealogy society
• Founded in 1993
• Hundreds of members, worldwide
• E-mail discussion group
• New website development in progress
  (existing website is fugly)
• Needs a search engine…for data
The Old Problem
The Old Problem
The New Problem
The New Problem
• Diverse Data Languages
  (German, Polish, Ukrainian, Russian, Yiddi
  sh, Hebrew, English…)
• Diverse Data Types
  (births, marriages, deaths, divorces, tax
  lists, landsmanschaften lists, industrial
  permit lists, school
  yearbooks, governmental yearbooks…)
Diverse Data Shapes
Diverse Data Shapes
Diverse Data Shapes
Existing solutions
• They‟re okay...for small numbers of
  databases, with small amounts of data

  – Steve Morse's One-Step Tool Creator
  – Roll-your-own solution with PHP and MySQL


• Both get more difficult to manage as data
  sets increase in number and complexity
In space, no one can hear your data scream
To Sum Up
• There are lots of ways to publish your tree
• …but not so many ways to publish your
  data
• Surely there must be a way to deal with
  this?
So I Made A Thing
But “That Thing I Made With The Database And Stuff”
     was kind of an awkward name, so I called it



             LeafSeek
This is the part where I show you all
the shiny new All Galicia Database

  http://search.geshergalicia.org/
Meet Apache Solr
• Highly functional open source search
  platform
• Based on Apache Lucene (Java)…
• …plus a web wrapper/API
• Not the prettiest or simplest tool
• FREE and open source
Saves Time, and Heartache
Saves Time, and Stomachache
File Structure: Back-End
Welcome to /conf
The Important Stuff
solrconfig.xml
solrconfig.xml

Make sure this part is configured, so you can
import data:
How to get your data into Solr
• Step 1: Make a properly-formatted
  spreadsheet
• Step 2: Save spreadsheet as a .CSV file
• Step 3: Create a MySQL database + table
• Step 4: Import CSV into that new table
• Step 5: Add a Unique Auto-Incrementing
  Primary Key called “id” (INT)
• Step 6: Add this table‟s information to
  db-data-config.xml
db-data-config.xml
• Basic XML file that tells Solr how to grab
  data from your MySQL database(s)
• Add new <dataSource> for new databases
• Add new <entity> for new tables within the
  databases
• You need to make sure your MySQL
  connector .jar is installed for this to work
Import!
schema.xml
• FieldTypes, Fields, and CopyFields
• FieldTypes give indexing and querying
  instructions to “buckets”
• Fields say what‟s what and whether to
  make something facetable or not
• CopyFields collect Fields together into
  extra FieldTypes
schema.xml - FieldTypes
• 5 Custom FieldTypes (so far):
  – givenname
  – surname
  – surname_bmpm (phonetic)
  – place (note: not merely town)
  – year (which we‟re treating as text right now)
schema.xml - FieldTypes
schema.xml - FieldTypes
schema.xml - Fields
schema.xml - Fields
• Uppercase fields come from the name of
  the MySQL column name
• Examples:
  – Year
  – SchoolYear
  – Surname
  – FathersTown
  – MothersFathersGivenName
  – MaternalGrandfathersGivenName
schema.xml - Fields
• Lowercase fields were added once the
  data is getting inputted to Solr, and start
  with the prefix record_
• Examples:
  – record_type (birth, death, tax, whatever)
  – record_source (name of repository)
  – record_latlong (latitude,longitude)
  – record_id (required!)
schema.xml - Fields
• You do not have to explicitly define every
  Field.
• If something is imported that is not named
  and defined in schema.xml it will just be
  indexed as a straight-up text string, with
  nothing done to it.
• Which is fine.
• But IMHO it‟s better to define everything
  anyway so you can remember what‟s what
  and what you are doing to it.
schema.xml - CopyFields
Add-ons and nice-to-have‟s
         (for the back-end)
• Wildcards, and lots of „em
• Non-name words handled through
  stopwords.txt
• Nicknames and name synonyms handled
  through synonyms.txt
• Two files included:
  – synonyms_-_american-anglo-saxon.txt
  – synonyms_-_polish-ukrainian-jewish.txt
• Should be based on your data and your
  historical/ethnic community standards
More add-ons and nice-to-have‟s
        (for the back-end)
• Translate your site into different languages – multi-
  lingual content deserves a real multi-lingual
  website
   – Pass user preferences through GET value or through
     accept-language header or read from a cookie or
     whatever you want
• Built-in performance monitoring hooks for New
  Relic
• Soundalike searches for surname variants
   – Levenstein distance
   – “Regular” Soundex, Metaphone, Caverphone, etc.
This is the part where I tell
          the story about


     THE SAGA
of Beider-Morse Phonetic Matching
             (BMPM)
Relevancy
• Right now, we‟re using exact matches
• (Of course, “exact” includes
  wildcards, alternate names /
  synonyms, etc.)
• Like “Old Search” on Ancestry.com
• DisMax! Boosting fields! Scoring!
• (…but not yet)
• Problems with records with multiple
  people‟s names in the record
Lots of Front-End Options
• Ruby:
  Sunspot, RSolr, Tanning Bed, acts-as-solr
• Django/Python:
  Haystack, Sunburnt, solrpy, pysolr
• Older PHP options:
  PECL, solr-php-client
• Plugins for blog/CMS systems:
  Drupal, WordPress
Meet Solarium
•   http://www.solarium-project.org/
•   New, open source PHP wrapper for Solr
•   Very active development
•   Version 2.4 coming soon
File Structure: Front-End
Meet Solarium: The Config
Meet Solarium: The Guts
Meet Solarium: The Guts
• You choose the parts of your data to facet
• Data is submitted to the front-end by
  POST, not by GET, so the URL never
  changes
• You can (and should) paginate results
  listings
• You can't actually see the Solr server's
  URL from the front-end, not even in view-
  source
Add-ons and nice-to-have‟s
        (for the front-end)
• A welcome screen with information about
  the database's contents
• Instructions (maybe twice)
• How many records in the database?
• How many datasets?
• What features are coming next?
• What datasets are coming next?
Add-ons and nice-to-have‟s
           (for the front-end)
•   Make good UI choices
•   Pop-Up Google Maps
•   Tooltips to reduce UI clutter
•   Cross-browser compatibility
•   Still stuck with IE 7 and 8
•   CSS and code that degrades gracefully
•   No small text
Bird‟s Eye View of Your Data
• What (surnames, towns, etc.) do I have in
  my data?
• What are the TOP (surnames, towns, etc.)
  in my data?
• Finding incorrect data
  – Outlying years and dates
  – Figure out that hard-to-read surname
• Make charts and graphs from your data
The (Back-End) Future!        (Maybe.)

• Date ranges, instead of just years
• Auto-complete as you type
• “Did you mean...?”
  (based on data frequency)
• “More Like This”
  (would have to do scoring)
• Record bookmarking system (hashes?)
The (Front-End) Future!         (Maybe.)

• Hierarchical facets for locations
• Disambiguating locations
• Social sharing of individual records
• New genealogy data schema
  http://historical-data.org/
• Membership login system
Please Do Not Build That Wall
• Password protect some of the databases
• Password protect some of the data
• Open data, but pay for record or surname
  bookmarking system
• Open data, but pay for API access
• Open data, but sell online ads
• Open data, but give people guilt trips
Presenting LeafSeek!
•   Free and Open Source
•   Code is all on GitHub
•   Please add, edit, fix, change, tinker
•   …and use it!
Why is this FREE?

And why is this important?
Thank you! :-)

Contenu connexe

Tendances

The Master Genealogist for Beginners 2012
The Master Genealogist for Beginners 2012The Master Genealogist for Beginners 2012
The Master Genealogist for Beginners 2012Teresa Pask
 
Intro to Neo4j - Nicole White
Intro to Neo4j - Nicole WhiteIntro to Neo4j - Nicole White
Intro to Neo4j - Nicole WhiteNeo4j
 
NoSQL and Triple Stores
NoSQL and Triple StoresNoSQL and Triple Stores
NoSQL and Triple Storesandyseaborne
 
Apache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthApache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthDatabricks
 
20171106 sesug bb 180 proc import ppt
20171106 sesug bb 180 proc import ppt20171106 sesug bb 180 proc import ppt
20171106 sesug bb 180 proc import pptDavid Horvath
 
Data Exploration with Elasticsearch
Data Exploration with ElasticsearchData Exploration with Elasticsearch
Data Exploration with ElasticsearchAleksander Stensby
 
semantic markup using schema.org
semantic markup using schema.orgsemantic markup using schema.org
semantic markup using schema.orgJoshua Shinavier
 

Tendances (8)

The Master Genealogist for Beginners 2012
The Master Genealogist for Beginners 2012The Master Genealogist for Beginners 2012
The Master Genealogist for Beginners 2012
 
Intro to Neo4j - Nicole White
Intro to Neo4j - Nicole WhiteIntro to Neo4j - Nicole White
Intro to Neo4j - Nicole White
 
Basics of Web Research for ELA 10
Basics of Web Research for ELA 10Basics of Web Research for ELA 10
Basics of Web Research for ELA 10
 
NoSQL and Triple Stores
NoSQL and Triple StoresNoSQL and Triple Stores
NoSQL and Triple Stores
 
Apache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthApache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in Depth
 
20171106 sesug bb 180 proc import ppt
20171106 sesug bb 180 proc import ppt20171106 sesug bb 180 proc import ppt
20171106 sesug bb 180 proc import ppt
 
Data Exploration with Elasticsearch
Data Exploration with ElasticsearchData Exploration with Elasticsearch
Data Exploration with Elasticsearch
 
semantic markup using schema.org
semantic markup using schema.orgsemantic markup using schema.org
semantic markup using schema.org
 

En vedette

Russian Language Centre
Russian Language CentreRussian Language Centre
Russian Language CentreLucy Bullett
 
Russian for Beginners
Russian for BeginnersRussian for Beginners
Russian for BeginnersIrina Bubnova
 
How many people speak and will speak the russian language
How many people  speak and will speak the russian languageHow many people  speak and will speak the russian language
How many people speak and will speak the russian languageSecondary School from Helsinki
 
Ensemble Contextual Bandits for Personalized Recommendation
Ensemble Contextual Bandits for Personalized RecommendationEnsemble Contextual Bandits for Personalized Recommendation
Ensemble Contextual Bandits for Personalized RecommendationLiang Tang
 
Hieber - An Introduction to Typology, Part II: Voice & Transitivity
Hieber - An Introduction to Typology, Part II: Voice & TransitivityHieber - An Introduction to Typology, Part II: Voice & Transitivity
Hieber - An Introduction to Typology, Part II: Voice & TransitivityDaniel Hieber
 
Amarigna & Tigrigna Qal Roots of Russian Language
Amarigna & Tigrigna Qal Roots of Russian LanguageAmarigna & Tigrigna Qal Roots of Russian Language
Amarigna & Tigrigna Qal Roots of Russian LanguageLegesse Allyn
 
towards mulitlingual cultural lexicography. the russian dialect dictionary as...
towards mulitlingual cultural lexicography. the russian dialect dictionary as...towards mulitlingual cultural lexicography. the russian dialect dictionary as...
towards mulitlingual cultural lexicography. the russian dialect dictionary as...eveline wandl-vogt
 
Linguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageLinguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageDmitry Kan
 
Russia, Russians and Russian language
Russia, Russians and Russian languageRussia, Russians and Russian language
Russia, Russians and Russian languageKaterina Vylomova
 
Languages of the world
Languages of the worldLanguages of the world
Languages of the worldManu Alias
 
Learn Russian - FSI FAST Course (Part 3)
Learn Russian - FSI FAST Course (Part 3)Learn Russian - FSI FAST Course (Part 3)
Learn Russian - FSI FAST Course (Part 3)101_languages
 
Basic Russian Language Course
Basic Russian Language CourseBasic Russian Language Course
Basic Russian Language Course101_languages
 
Language families and branches
Language families and branchesLanguage families and branches
Language families and branchesPamela Sanhueza
 
Russian Language
Russian LanguageRussian Language
Russian LanguageIzzah Ros
 

En vedette (17)

Russian Language Centre
Russian Language CentreRussian Language Centre
Russian Language Centre
 
Russian for Beginners
Russian for BeginnersRussian for Beginners
Russian for Beginners
 
How many people speak and will speak the russian language
How many people  speak and will speak the russian languageHow many people  speak and will speak the russian language
How many people speak and will speak the russian language
 
Ensemble Contextual Bandits for Personalized Recommendation
Ensemble Contextual Bandits for Personalized RecommendationEnsemble Contextual Bandits for Personalized Recommendation
Ensemble Contextual Bandits for Personalized Recommendation
 
Hieber - An Introduction to Typology, Part II: Voice & Transitivity
Hieber - An Introduction to Typology, Part II: Voice & TransitivityHieber - An Introduction to Typology, Part II: Voice & Transitivity
Hieber - An Introduction to Typology, Part II: Voice & Transitivity
 
Amarigna & Tigrigna Qal Roots of Russian Language
Amarigna & Tigrigna Qal Roots of Russian LanguageAmarigna & Tigrigna Qal Roots of Russian Language
Amarigna & Tigrigna Qal Roots of Russian Language
 
Pre-incident plan
Pre-incident planPre-incident plan
Pre-incident plan
 
towards mulitlingual cultural lexicography. the russian dialect dictionary as...
towards mulitlingual cultural lexicography. the russian dialect dictionary as...towards mulitlingual cultural lexicography. the russian dialect dictionary as...
towards mulitlingual cultural lexicography. the russian dialect dictionary as...
 
Linguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageLinguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian language
 
Russia, Russians and Russian language
Russia, Russians and Russian languageRussia, Russians and Russian language
Russia, Russians and Russian language
 
Languages of the world
Languages of the worldLanguages of the world
Languages of the world
 
Russia
RussiaRussia
Russia
 
Learn Russian - FSI FAST Course (Part 3)
Learn Russian - FSI FAST Course (Part 3)Learn Russian - FSI FAST Course (Part 3)
Learn Russian - FSI FAST Course (Part 3)
 
Basic Russian Language Course
Basic Russian Language CourseBasic Russian Language Course
Basic Russian Language Course
 
Language families and branches
Language families and branchesLanguage families and branches
Language families and branches
 
Russian Language
Russian LanguageRussian Language
Russian Language
 
AINL 2016: Malykh
AINL 2016: MalykhAINL 2016: Malykh
AINL 2016: Malykh
 

Similaire à Creating an Open Source Genealogical Search Engine with Apache Solr

Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptxIke Ellis
 
Computer-assisted reporting seminar
Computer-assisted reporting seminarComputer-assisted reporting seminar
Computer-assisted reporting seminarGlen McGregor
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring databodaceacat
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring dataSara-Jayne Terp
 
Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.orgrvguha
 
NotaCon 2011 - Networking for Pentesters
NotaCon 2011 - Networking for PentestersNotaCon 2011 - Networking for Pentesters
NotaCon 2011 - Networking for PentestersRob Fuller
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talkrtelmore
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.Jurriaan Persyn
 
PostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty databasePostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty databaseBarry Jones
 
Winning the Big Data SPAM Challenge__HadoopSummit2010
Winning the Big Data SPAM Challenge__HadoopSummit2010Winning the Big Data SPAM Challenge__HadoopSummit2010
Winning the Big Data SPAM Challenge__HadoopSummit2010Yahoo Developer Network
 
Cassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at ScaleCassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at ScaleDataStax Academy
 
Nerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
Nerd Out with Hadoop: A Not-So-Basic Introduction to the PlatformNerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
Nerd Out with Hadoop: A Not-So-Basic Introduction to the PlatformSteve Hoffman
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerIke Ellis
 
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to SphinxMYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to SphinxPythian
 
Postgres Vision 2018: Five Sharding Data Models
Postgres Vision 2018: Five Sharding Data ModelsPostgres Vision 2018: Five Sharding Data Models
Postgres Vision 2018: Five Sharding Data ModelsEDB
 
05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptxGambari Amosa Isiaka
 

Similaire à Creating an Open Source Genealogical Search Engine with Apache Solr (20)

Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptx
 
Computer-assisted reporting seminar
Computer-assisted reporting seminarComputer-assisted reporting seminar
Computer-assisted reporting seminar
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.org
 
PHP - Introduction to PHP MySQL Joins and SQL Functions
PHP -  Introduction to PHP MySQL Joins and SQL FunctionsPHP -  Introduction to PHP MySQL Joins and SQL Functions
PHP - Introduction to PHP MySQL Joins and SQL Functions
 
NotaCon 2011 - Networking for Pentesters
NotaCon 2011 - Networking for PentestersNotaCon 2011 - Networking for Pentesters
NotaCon 2011 - Networking for Pentesters
 
Make Your Data Searchable With Solr in 25 Minutes
Make Your Data Searchable With Solr in 25 MinutesMake Your Data Searchable With Solr in 25 Minutes
Make Your Data Searchable With Solr in 25 Minutes
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talk
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
PostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty databasePostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty database
 
Winning the Big Data SPAM Challenge__HadoopSummit2010
Winning the Big Data SPAM Challenge__HadoopSummit2010Winning the Big Data SPAM Challenge__HadoopSummit2010
Winning the Big Data SPAM Challenge__HadoopSummit2010
 
Cassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at ScaleCassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at Scale
 
Nerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
Nerd Out with Hadoop: A Not-So-Basic Introduction to the PlatformNerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
Nerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to SphinxMYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
 
Postgres Vision 2018: Five Sharding Data Models
Postgres Vision 2018: Five Sharding Data ModelsPostgres Vision 2018: Five Sharding Data Models
Postgres Vision 2018: Five Sharding Data Models
 
05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx
 
Hands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop EcosystemHands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop Ecosystem
 
Splunk bsides
Splunk bsidesSplunk bsides
Splunk bsides
 

Dernier

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 

Dernier (20)

Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 

Creating an Open Source Genealogical Search Engine with Apache Solr

  • 1. Creating an Open Source Genealogical Search Engine With Apache Solr Brooke Schreier Ganz info@leafseek.com Twitter: @LeafSeek www.LeafSeek.com
  • 2. Hi, I‟m Brooke • I make web stuff for fun, and (sometimes) for profit • Web Developer at IBM.com and Disney Consumer Products • Lead Programmer at TMZ.com (yikes, sorry about that) • Senior Web Producer at Bravo cable TV network and its spin-off websites • Big dork • Big genealogy dork • #BigData dork
  • 3. Meet Gesher Galicia • Non-profit 501(c)3 genealogy society • Founded in 1993 • Hundreds of members, worldwide • E-mail discussion group • New website development in progress (existing website is fugly) • Needs a search engine…for data
  • 4.
  • 5.
  • 6.
  • 10. The New Problem • Diverse Data Languages (German, Polish, Ukrainian, Russian, Yiddi sh, Hebrew, English…) • Diverse Data Types (births, marriages, deaths, divorces, tax lists, landsmanschaften lists, industrial permit lists, school yearbooks, governmental yearbooks…)
  • 14. Existing solutions • They‟re okay...for small numbers of databases, with small amounts of data – Steve Morse's One-Step Tool Creator – Roll-your-own solution with PHP and MySQL • Both get more difficult to manage as data sets increase in number and complexity
  • 15. In space, no one can hear your data scream
  • 16. To Sum Up • There are lots of ways to publish your tree • …but not so many ways to publish your data • Surely there must be a way to deal with this?
  • 17.
  • 18.
  • 19. So I Made A Thing But “That Thing I Made With The Database And Stuff” was kind of an awkward name, so I called it LeafSeek
  • 20. This is the part where I show you all the shiny new All Galicia Database http://search.geshergalicia.org/
  • 21. Meet Apache Solr • Highly functional open source search platform • Based on Apache Lucene (Java)… • …plus a web wrapper/API • Not the prettiest or simplest tool • FREE and open source
  • 22.
  • 23.
  • 24. Saves Time, and Heartache
  • 25.
  • 26. Saves Time, and Stomachache
  • 27.
  • 28.
  • 33. solrconfig.xml Make sure this part is configured, so you can import data:
  • 34. How to get your data into Solr • Step 1: Make a properly-formatted spreadsheet • Step 2: Save spreadsheet as a .CSV file • Step 3: Create a MySQL database + table • Step 4: Import CSV into that new table • Step 5: Add a Unique Auto-Incrementing Primary Key called “id” (INT) • Step 6: Add this table‟s information to db-data-config.xml
  • 35.
  • 36.
  • 37. db-data-config.xml • Basic XML file that tells Solr how to grab data from your MySQL database(s) • Add new <dataSource> for new databases • Add new <entity> for new tables within the databases • You need to make sure your MySQL connector .jar is installed for this to work
  • 38.
  • 40.
  • 41. schema.xml • FieldTypes, Fields, and CopyFields • FieldTypes give indexing and querying instructions to “buckets” • Fields say what‟s what and whether to make something facetable or not • CopyFields collect Fields together into extra FieldTypes
  • 42. schema.xml - FieldTypes • 5 Custom FieldTypes (so far): – givenname – surname – surname_bmpm (phonetic) – place (note: not merely town) – year (which we‟re treating as text right now)
  • 46. schema.xml - Fields • Uppercase fields come from the name of the MySQL column name • Examples: – Year – SchoolYear – Surname – FathersTown – MothersFathersGivenName – MaternalGrandfathersGivenName
  • 47. schema.xml - Fields • Lowercase fields were added once the data is getting inputted to Solr, and start with the prefix record_ • Examples: – record_type (birth, death, tax, whatever) – record_source (name of repository) – record_latlong (latitude,longitude) – record_id (required!)
  • 48. schema.xml - Fields • You do not have to explicitly define every Field. • If something is imported that is not named and defined in schema.xml it will just be indexed as a straight-up text string, with nothing done to it. • Which is fine. • But IMHO it‟s better to define everything anyway so you can remember what‟s what and what you are doing to it.
  • 50.
  • 51. Add-ons and nice-to-have‟s (for the back-end) • Wildcards, and lots of „em • Non-name words handled through stopwords.txt • Nicknames and name synonyms handled through synonyms.txt • Two files included: – synonyms_-_american-anglo-saxon.txt – synonyms_-_polish-ukrainian-jewish.txt • Should be based on your data and your historical/ethnic community standards
  • 52.
  • 53.
  • 54. More add-ons and nice-to-have‟s (for the back-end) • Translate your site into different languages – multi- lingual content deserves a real multi-lingual website – Pass user preferences through GET value or through accept-language header or read from a cookie or whatever you want • Built-in performance monitoring hooks for New Relic • Soundalike searches for surname variants – Levenstein distance – “Regular” Soundex, Metaphone, Caverphone, etc.
  • 55. This is the part where I tell the story about THE SAGA of Beider-Morse Phonetic Matching (BMPM)
  • 56. Relevancy • Right now, we‟re using exact matches • (Of course, “exact” includes wildcards, alternate names / synonyms, etc.) • Like “Old Search” on Ancestry.com • DisMax! Boosting fields! Scoring! • (…but not yet) • Problems with records with multiple people‟s names in the record
  • 57.
  • 58.
  • 59. Lots of Front-End Options • Ruby: Sunspot, RSolr, Tanning Bed, acts-as-solr • Django/Python: Haystack, Sunburnt, solrpy, pysolr • Older PHP options: PECL, solr-php-client • Plugins for blog/CMS systems: Drupal, WordPress
  • 60. Meet Solarium • http://www.solarium-project.org/ • New, open source PHP wrapper for Solr • Very active development • Version 2.4 coming soon
  • 64. Meet Solarium: The Guts • You choose the parts of your data to facet • Data is submitted to the front-end by POST, not by GET, so the URL never changes • You can (and should) paginate results listings • You can't actually see the Solr server's URL from the front-end, not even in view- source
  • 65. Add-ons and nice-to-have‟s (for the front-end) • A welcome screen with information about the database's contents • Instructions (maybe twice) • How many records in the database? • How many datasets? • What features are coming next? • What datasets are coming next?
  • 66. Add-ons and nice-to-have‟s (for the front-end) • Make good UI choices • Pop-Up Google Maps • Tooltips to reduce UI clutter • Cross-browser compatibility • Still stuck with IE 7 and 8 • CSS and code that degrades gracefully • No small text
  • 67. Bird‟s Eye View of Your Data • What (surnames, towns, etc.) do I have in my data? • What are the TOP (surnames, towns, etc.) in my data? • Finding incorrect data – Outlying years and dates – Figure out that hard-to-read surname • Make charts and graphs from your data
  • 68.
  • 69. The (Back-End) Future! (Maybe.) • Date ranges, instead of just years • Auto-complete as you type • “Did you mean...?” (based on data frequency) • “More Like This” (would have to do scoring) • Record bookmarking system (hashes?)
  • 70. The (Front-End) Future! (Maybe.) • Hierarchical facets for locations • Disambiguating locations • Social sharing of individual records • New genealogy data schema http://historical-data.org/ • Membership login system
  • 71.
  • 72. Please Do Not Build That Wall • Password protect some of the databases • Password protect some of the data • Open data, but pay for record or surname bookmarking system • Open data, but pay for API access • Open data, but sell online ads • Open data, but give people guilt trips
  • 73. Presenting LeafSeek! • Free and Open Source • Code is all on GitHub • Please add, edit, fix, change, tinker • …and use it!
  • 74.
  • 75.
  • 76. Why is this FREE? And why is this important?