SlideShare a Scribd company logo
1 of 24
Find Me a Roof !
project for “Gestione dell’informazione sul Web” class
                    AA 2009-2010
Alessandro Manfredi & Marco Bontempi & Marco Giannone
     {a.n0on3,bontempi,marco.giannone}@gmail.com
Goals
✓ Build a search engine on the vertical domain of realties
  advertisement.

✓ Index-linking informations from multiple sources.
✓ Design so that adding sources will be easy.
✓ Enriching poor informations with web services
  integration.

✓ Provide a user-friendly interface for localized and
  domain-field selective efficient searches.

✓ “Did you mean ... ?” and search suggestions.
✓ Deploy on Amazon EC2/S3.
Preview
Preview ( autocomplete )
Preview ( results )
Preview ( did you mean ... ? )
What we used
Back End Overview
                                                Download &
                                                 Dispatch


                               url repository
      roof bots


                                                Extractor 11
                                                 Extractor 1
          Main                                    Extractor

   LUCENE Indexes                               Extractor 11
                                                 Extractor 2
                                     DB          Extractor
SpellChecker   AutoCompleter
                                                      ...
                                                Extractor 11
                                                 Extractor n
                                                 Extractor
Back End Overview
                                                              Download &
                                                               Dispatch


                                             url repository
      roof bots


                                                              Extractor 11
                                                               Extractor 1
          Main                                                  Extractor

   LUCENE Indexes                                             Extractor 11
                                                               Extractor 2
                                                   DB          Extractor
SpellChecker    AutoCompleter
                                                                    ...
                                                              Extractor 11
                                                               Extractor n
                                                               Extractor
                     Why the DB ?
               will be explained later ...
Crawling
• Collecting informations from
  • www.trova-casa.net
  • www.immobiliare.it
• First attempt on trova-casa.net :
  • multithreading bruteforce on same-
    structured url: after 75 k ...
Crawling
• Collecting informations from
  • www.trova-casa.net
  • www.immobiliare.it
• First attempt on trova-casa.net :
  • multithreading bruteforce on same-
    structured url: after 75 k ...

 • ... we got banned :-)
Crawling

• WebSphinx ( Carnegie Mellon University )
   • http://www-2.cs.cmu.edu/~rcm/websphinx/

• Timeout: 1s
• Limited scope to Rome and
   surroundings

   • Regex on url to visit and save
   • Coordinate filtering
Crawling
• Somehow WebSphinx stopped before reaching
  all of the realties ads...

• We wrote a simple PHP roofbot:
  • Starting from sitemaps
  • Reach indexing pages
  • Collecting urls with given navigation paths
• This way we reached all of the ~87k ads
  available in Rome and surroundings.
Data Extraction
•          HtmlUnit + Neko

•          JTidy + XPath
    ( even if #562127 (JTidy) forced us to skip few fields )


• Information collected :
     • Data ( realty type, contract type, address,
          surface, price, coordinates, contacts )

     • Text ( description )
• Data has been cleaned with regex
Data Enrichment
• Using Google maps API and web-services
   • Adding coordinates from the address
       • Geocoding WS with csv output :
   •   http://maps.google.com/maps/geo?output=csv&sensor=false&q=...


   • Adding address from coordinates
       • API Geocoding WS, max 2.500 requests / day :
   •   http://maps.google.com/maps/api/geocode/xml?sensor=false&latlng=...


• This works for 83% of performed requests.
   • i.e. failed when street numbers are out of google
       knowledge or when streets names are mistyped.
Text search
• While the user is typing, AutoCompleter
  index is queried to give suggestions using
  javascript.

• The Main index is used for search
  • If less than a threshold results are
    returned or if the highter score is too
    low, SpellChecker index is invoked to
    guess possible spell errors and results
    for the deducted correct query are also
    displayed.
Suggestions

• Actually, since AutoCompleter index often
  returned results for negligible words and
  don’t provide support for phrase-queries,
  we returned suggestions searching on a
  list of common locations and keywords.

• In production, this list may be feed with
  most common searches.
Why use a DB ?
        • To take advantages of indexes for
          efficient in-range searches for data
          analysis.
        • E.g. provide the average price for surface
          unit in the location with pickable range.
        • Chance to delegate filtering to the

          LUCENE
         Main Index
                           ID-based
QUERY                       Merge
                                               Results

           DB
An Example
SELECT avg("Prezzo"/"Superficie") FROM "Annunci"
WHERE "Contratto" = ‘Vendita’
AND "Latitudine" < X AND "Latitudine" > Y
AND "Longitudine" > Z AND "Longitudine" < W
AND "Superficie"   != 0 AND "Prezzo" != 0 ;
The current implementation
 • Filtering is performed at application level
   over lucene main index results
 • Database is used for data analysis
                     QUERY

                 LUCENE Main Index


       Data
      Analysis
                                     DB

                      Merge

                     Results
Data Analysis
• Right now, limited to the comparison
  with the local price for surface unit.
Geolocation




• Users can navigate the map to select their
  location of interest, and filter out ads
  located outside even if matching the
  query.
Deploy on AWS


• Launch and configure an EC2 AMI ( Amazon
  Machine Image ) starting from community
  provided “Debian” Linux AMI

• Saving the instance on S3 to preserve
  filesystem:
  •   ec2-bundle-vol -k <KEY> -c <CERT> -u <USER-ID> --destination /mnt --exclude /mnt

  •   ec2-upload-bundle -b <S3-bucket-name> -m /mnt/image.manifest.xml -a <ACCESS-KEY> -s
      <SECRET-KEY>

  •   ec2-register <S3-bucket-name>/image.manifest.xml -n <AMI-NAME> -K <KEY> -C <CERT>
Find Me a Roof !
                      ( we don’t let you living under a bridge )




                  Thanks


project for “Gestione dell’informazione sul Web” class
                    AA 2009-2010
Alessandro Manfredi & Marco Bontempi & Marco Giannone
     {a.n0on3,bontempi,marco.giannone}@gmail.com

More Related Content

Similar to Find me a roof!

Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfSease
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Spark Summit
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Sease
 
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmGenomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmDmitri Zimine
 
Play Framework and Activator
Play Framework and ActivatorPlay Framework and Activator
Play Framework and ActivatorKevin Webber
 
Java day2016 "Reinventing design patterns with java 8"
Java day2016 "Reinventing design patterns with java 8"Java day2016 "Reinventing design patterns with java 8"
Java day2016 "Reinventing design patterns with java 8"Alexander Pashynskiy
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit
 
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...Behar Veliqi
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache SolrSease
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101Huy Vo
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…Sergey Dzyuban
 
Kubernetes for java developers - Tutorial at Oracle Code One 2018
Kubernetes for java developers - Tutorial at Oracle Code One 2018Kubernetes for java developers - Tutorial at Oracle Code One 2018
Kubernetes for java developers - Tutorial at Oracle Code One 2018Anthony Dahanne
 
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...Patrick Chanezon
 
使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster 使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster inwin stack
 
Installing and tweaking FASTSearch
Installing and tweaking FASTSearchInstalling and tweaking FASTSearch
Installing and tweaking FASTSearchArno Flapper
 
Kubernetes上で動作する機械学習モジュールの配信&管理基盤Rekcurd について
Kubernetes上で動作する機械学習モジュールの配信&管理基盤Rekcurd についてKubernetes上で動作する機械学習モジュールの配信&管理基盤Rekcurd について
Kubernetes上で動作する機械学習モジュールの配信&管理基盤Rekcurd についてLINE Corporation
 
JavaOne 2009 - Full-Text Search: Human Heaven and Database Savior in the Cloud
JavaOne 2009 - Full-Text Search: Human Heaven and Database Savior in the CloudJavaOne 2009 - Full-Text Search: Human Heaven and Database Savior in the Cloud
JavaOne 2009 - Full-Text Search: Human Heaven and Database Savior in the CloudAaron Walker
 

Similar to Find me a roof! (20)

Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
 
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmGenomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
 
Docker Kubernetes Istio
Docker Kubernetes IstioDocker Kubernetes Istio
Docker Kubernetes Istio
 
Play Framework and Activator
Play Framework and ActivatorPlay Framework and Activator
Play Framework and Activator
 
Docker Kubernetes Istio
Docker Kubernetes IstioDocker Kubernetes Istio
Docker Kubernetes Istio
 
Java day2016 "Reinventing design patterns with java 8"
Java day2016 "Reinventing design patterns with java 8"Java day2016 "Reinventing design patterns with java 8"
Java day2016 "Reinventing design patterns with java 8"
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar Veliqi
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
 
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...
Spark Summit - Watson Analytics for Social Media: From single tenant Hadoop t...
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
 
To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…To Build My Own Cloud with Blackjack…
To Build My Own Cloud with Blackjack…
 
Kubernetes for java developers - Tutorial at Oracle Code One 2018
Kubernetes for java developers - Tutorial at Oracle Code One 2018Kubernetes for java developers - Tutorial at Oracle Code One 2018
Kubernetes for java developers - Tutorial at Oracle Code One 2018
 
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...
Docker Orchestration: Welcome to the Jungle! Devoxx & Docker Meetup Tour Nov ...
 
使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster 使用 Prometheus 監控 Kubernetes Cluster
使用 Prometheus 監控 Kubernetes Cluster
 
Installing and tweaking FASTSearch
Installing and tweaking FASTSearchInstalling and tweaking FASTSearch
Installing and tweaking FASTSearch
 
Kubernetes上で動作する機械学習モジュールの配信&管理基盤Rekcurd について
Kubernetes上で動作する機械学習モジュールの配信&管理基盤Rekcurd についてKubernetes上で動作する機械学習モジュールの配信&管理基盤Rekcurd について
Kubernetes上で動作する機械学習モジュールの配信&管理基盤Rekcurd について
 
JavaOne 2009 - Full-Text Search: Human Heaven and Database Savior in the Cloud
JavaOne 2009 - Full-Text Search: Human Heaven and Database Savior in the CloudJavaOne 2009 - Full-Text Search: Human Heaven and Database Savior in the Cloud
JavaOne 2009 - Full-Text Search: Human Heaven and Database Savior in the Cloud
 

More from Alessandro Manfredi

More from Alessandro Manfredi (9)

Hey Cloud, it’s the user calling, he says he wants the security back
Hey Cloud, it’s the user calling, he says he wants the security backHey Cloud, it’s the user calling, he says he wants the security back
Hey Cloud, it’s the user calling, he says he wants the security back
 
WhyMCA HappyHour - EUHackathon Part II
WhyMCA HappyHour - EUHackathon Part IIWhyMCA HappyHour - EUHackathon Part II
WhyMCA HappyHour - EUHackathon Part II
 
Connect (4|n)
Connect (4|n)Connect (4|n)
Connect (4|n)
 
LUG - Ricompilazione kernel
LUG - Ricompilazione kernelLUG - Ricompilazione kernel
LUG - Ricompilazione kernel
 
LUG - Logical volumes management
LUG - Logical volumes managementLUG - Logical volumes management
LUG - Logical volumes management
 
LUG - Install Fest 2008
LUG - Install Fest 2008LUG - Install Fest 2008
LUG - Install Fest 2008
 
Advanced Shell Scripting
Advanced Shell ScriptingAdvanced Shell Scripting
Advanced Shell Scripting
 
ExAlg Overview
ExAlg OverviewExAlg Overview
ExAlg Overview
 
The "vi" Text Editor
The "vi" Text EditorThe "vi" Text Editor
The "vi" Text Editor
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 

Recently uploaded (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Find me a roof!

  • 1. Find Me a Roof ! project for “Gestione dell’informazione sul Web” class AA 2009-2010 Alessandro Manfredi & Marco Bontempi & Marco Giannone {a.n0on3,bontempi,marco.giannone}@gmail.com
  • 2. Goals ✓ Build a search engine on the vertical domain of realties advertisement. ✓ Index-linking informations from multiple sources. ✓ Design so that adding sources will be easy. ✓ Enriching poor informations with web services integration. ✓ Provide a user-friendly interface for localized and domain-field selective efficient searches. ✓ “Did you mean ... ?” and search suggestions. ✓ Deploy on Amazon EC2/S3.
  • 6. Preview ( did you mean ... ? )
  • 8. Back End Overview Download & Dispatch url repository roof bots Extractor 11 Extractor 1 Main Extractor LUCENE Indexes Extractor 11 Extractor 2 DB Extractor SpellChecker AutoCompleter ... Extractor 11 Extractor n Extractor
  • 9. Back End Overview Download & Dispatch url repository roof bots Extractor 11 Extractor 1 Main Extractor LUCENE Indexes Extractor 11 Extractor 2 DB Extractor SpellChecker AutoCompleter ... Extractor 11 Extractor n Extractor Why the DB ? will be explained later ...
  • 10. Crawling • Collecting informations from • www.trova-casa.net • www.immobiliare.it • First attempt on trova-casa.net : • multithreading bruteforce on same- structured url: after 75 k ...
  • 11. Crawling • Collecting informations from • www.trova-casa.net • www.immobiliare.it • First attempt on trova-casa.net : • multithreading bruteforce on same- structured url: after 75 k ... • ... we got banned :-)
  • 12. Crawling • WebSphinx ( Carnegie Mellon University ) • http://www-2.cs.cmu.edu/~rcm/websphinx/ • Timeout: 1s • Limited scope to Rome and surroundings • Regex on url to visit and save • Coordinate filtering
  • 13. Crawling • Somehow WebSphinx stopped before reaching all of the realties ads... • We wrote a simple PHP roofbot: • Starting from sitemaps • Reach indexing pages • Collecting urls with given navigation paths • This way we reached all of the ~87k ads available in Rome and surroundings.
  • 14. Data Extraction • HtmlUnit + Neko • JTidy + XPath ( even if #562127 (JTidy) forced us to skip few fields ) • Information collected : • Data ( realty type, contract type, address, surface, price, coordinates, contacts ) • Text ( description ) • Data has been cleaned with regex
  • 15. Data Enrichment • Using Google maps API and web-services • Adding coordinates from the address • Geocoding WS with csv output : • http://maps.google.com/maps/geo?output=csv&sensor=false&q=... • Adding address from coordinates • API Geocoding WS, max 2.500 requests / day : • http://maps.google.com/maps/api/geocode/xml?sensor=false&latlng=... • This works for 83% of performed requests. • i.e. failed when street numbers are out of google knowledge or when streets names are mistyped.
  • 16. Text search • While the user is typing, AutoCompleter index is queried to give suggestions using javascript. • The Main index is used for search • If less than a threshold results are returned or if the highter score is too low, SpellChecker index is invoked to guess possible spell errors and results for the deducted correct query are also displayed.
  • 17. Suggestions • Actually, since AutoCompleter index often returned results for negligible words and don’t provide support for phrase-queries, we returned suggestions searching on a list of common locations and keywords. • In production, this list may be feed with most common searches.
  • 18. Why use a DB ? • To take advantages of indexes for efficient in-range searches for data analysis. • E.g. provide the average price for surface unit in the location with pickable range. • Chance to delegate filtering to the LUCENE Main Index ID-based QUERY Merge Results DB
  • 19. An Example SELECT avg("Prezzo"/"Superficie") FROM "Annunci" WHERE "Contratto" = ‘Vendita’ AND "Latitudine" < X AND "Latitudine" > Y AND "Longitudine" > Z AND "Longitudine" < W AND "Superficie" != 0 AND "Prezzo" != 0 ;
  • 20. The current implementation • Filtering is performed at application level over lucene main index results • Database is used for data analysis QUERY LUCENE Main Index Data Analysis DB Merge Results
  • 21. Data Analysis • Right now, limited to the comparison with the local price for surface unit.
  • 22. Geolocation • Users can navigate the map to select their location of interest, and filter out ads located outside even if matching the query.
  • 23. Deploy on AWS • Launch and configure an EC2 AMI ( Amazon Machine Image ) starting from community provided “Debian” Linux AMI • Saving the instance on S3 to preserve filesystem: • ec2-bundle-vol -k <KEY> -c <CERT> -u <USER-ID> --destination /mnt --exclude /mnt • ec2-upload-bundle -b <S3-bucket-name> -m /mnt/image.manifest.xml -a <ACCESS-KEY> -s <SECRET-KEY> • ec2-register <S3-bucket-name>/image.manifest.xml -n <AMI-NAME> -K <KEY> -C <CERT>
  • 24. Find Me a Roof ! ( we don’t let you living under a bridge ) Thanks project for “Gestione dell’informazione sul Web” class AA 2009-2010 Alessandro Manfredi & Marco Bontempi & Marco Giannone {a.n0on3,bontempi,marco.giannone}@gmail.com

Editor's Notes