SlideShare une entreprise Scribd logo
1  sur  14
Télécharger pour lire hors ligne
A mini-google for a private phpBB forum
Background Recently joined a non-technical forum Set up by a friend of the founder – hasn’t been seen for 1-2 years Running on shared hosting Search limited to past years content to keep things fast
3 steps Mirror site locally Insert content in a database Release sphinx
The Problems It’s private
The Problems It’s private  It’s not well formed
The Problems It’s private  It’s not well formed Page info contained in query string viewtopic.php?f=30&t=10170
Wget Old faithful – first tool I reach for when I want to download anything from a website Simple for simple tasks but with the flexibility to handle more complex tasks
Wget – Logging in Ability to import a Netscape style cookies file Didn’t work for me Wrong format? Wrongly configured wget? phpBB checking user agent / ip address? Log in directly in wget Multi-step process
Wget – are we there yet? There is massive redundancy in the link structure Every post has an individual link which pulls in the entire topic Can’t exclude based on query string
Zend_HTTP Most of my time spent with PHP and more recently Zend framework There is a lot to be said for using a tool you are familiar with
Zend_HTTP
Scraping HTML First needed to correct errors Tidy extension SimpleXML Need to change xmlns to ns Still doesn’t work in all cases
Releasing Sphinx Ridiculously simple Simple config file adapted from example Runs for ~30s for ~90k posts Added a simply database query to beef up web interface from the example  Only downside – memory footprint
Future tasks Keep index updated Implemented but could be more efficient Exercise to learn python

Contenu connexe

En vedette

getting your feet wet with jquery
getting your feet wet with jquerygetting your feet wet with jquery
getting your feet wet with jqueryBenjamin Sterling
 
Getting Your Feet Wet With jQuery
Getting Your Feet Wet With jQueryGetting Your Feet Wet With jQuery
Getting Your Feet Wet With jQueryBenjamin Sterling
 
Interactive WebMap Dundee Vineyards, Oregon
Interactive WebMap Dundee Vineyards, OregonInteractive WebMap Dundee Vineyards, Oregon
Interactive WebMap Dundee Vineyards, OregonDonnych Diaz
 
Montinore Estates Slide Show
Montinore Estates Slide ShowMontinore Estates Slide Show
Montinore Estates Slide ShowDonnych Diaz
 
Purple Martins Nesting Sites
Purple Martins Nesting SitesPurple Martins Nesting Sites
Purple Martins Nesting SitesDonnych Diaz
 
EPA Reported Chemical Releases in Zipcode 97124
EPA Reported Chemical Releases in Zipcode 97124EPA Reported Chemical Releases in Zipcode 97124
EPA Reported Chemical Releases in Zipcode 97124Donnych Diaz
 

En vedette (6)

getting your feet wet with jquery
getting your feet wet with jquerygetting your feet wet with jquery
getting your feet wet with jquery
 
Getting Your Feet Wet With jQuery
Getting Your Feet Wet With jQueryGetting Your Feet Wet With jQuery
Getting Your Feet Wet With jQuery
 
Interactive WebMap Dundee Vineyards, Oregon
Interactive WebMap Dundee Vineyards, OregonInteractive WebMap Dundee Vineyards, Oregon
Interactive WebMap Dundee Vineyards, Oregon
 
Montinore Estates Slide Show
Montinore Estates Slide ShowMontinore Estates Slide Show
Montinore Estates Slide Show
 
Purple Martins Nesting Sites
Purple Martins Nesting SitesPurple Martins Nesting Sites
Purple Martins Nesting Sites
 
EPA Reported Chemical Releases in Zipcode 97124
EPA Reported Chemical Releases in Zipcode 97124EPA Reported Chemical Releases in Zipcode 97124
EPA Reported Chemical Releases in Zipcode 97124
 

Dernier

Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 

Dernier (20)

Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 

Dedicated search for a private phpBB forum using sphinx

  • 1. A mini-google for a private phpBB forum
  • 2. Background Recently joined a non-technical forum Set up by a friend of the founder – hasn’t been seen for 1-2 years Running on shared hosting Search limited to past years content to keep things fast
  • 3. 3 steps Mirror site locally Insert content in a database Release sphinx
  • 5. The Problems It’s private It’s not well formed
  • 6. The Problems It’s private It’s not well formed Page info contained in query string viewtopic.php?f=30&t=10170
  • 7. Wget Old faithful – first tool I reach for when I want to download anything from a website Simple for simple tasks but with the flexibility to handle more complex tasks
  • 8. Wget – Logging in Ability to import a Netscape style cookies file Didn’t work for me Wrong format? Wrongly configured wget? phpBB checking user agent / ip address? Log in directly in wget Multi-step process
  • 9. Wget – are we there yet? There is massive redundancy in the link structure Every post has an individual link which pulls in the entire topic Can’t exclude based on query string
  • 10. Zend_HTTP Most of my time spent with PHP and more recently Zend framework There is a lot to be said for using a tool you are familiar with
  • 12. Scraping HTML First needed to correct errors Tidy extension SimpleXML Need to change xmlns to ns Still doesn’t work in all cases
  • 13. Releasing Sphinx Ridiculously simple Simple config file adapted from example Runs for ~30s for ~90k posts Added a simply database query to beef up web interface from the example Only downside – memory footprint
  • 14. Future tasks Keep index updated Implemented but could be more efficient Exercise to learn python