SlideShare une entreprise Scribd logo
1  sur  30
FROM LEGACY, TO BATCH,
  TO NEAR REAL-TIME
      Marc Sturlese, Dani Solà
WHO ARE WE?

•   Marc Sturlese - @sturlese

    • Backend   engineer, focused on R&D

    • Interests: search, scalability

•   Dani Solà - @dani_sola

    • Backend   engineer

    • Interests: distributed   systems, data mining, search,...
TROVIT
Search engine for classifieds: 6 verticals, 38 countries & growing
FROM LEGACY TO BATCH

• Old   architecture

• Why    & when we changed

• Current     architecture

• Hive, Pig   & custom tools

• Migration    process
OLD ARCHITECTURE

• Based    on MySQL and PHP scripts

• Indexes     created with DataImportHandler


      Incoming data                  DataImportHandler




                                                         Lucene Indexes
                             MySQL

          PHP Scripts
WHEN & WHY WE MOVED

• Sharded   strategies are hard to maintain

• We    had 10M rows in a single table

• Many   processes working on MySQL databases

• We    wanted a more maintainable codebase

• The   solution was pretty obvious...
CURRENT ARCHITECTURE


• Based   on Hadoop

• Batch   process that reprocess all the ads...

• But   needs to be aware of the previous execution!

• Hive   & custom tools to know what happens
CURRENT ARCHITECTURE

Incoming data          External Data                                Lucene Indexes
                                                                           Deployment




Ad Processor    Diff     Matching      Expiration   Deduplication     Indexing

                           t-1                           Hadoop Cluster

                              Hive Stats
AD PROCESSOR

Incoming data     • Converts    text files to Thrift objects

                  • Checks    that the ads are complete

                  • Searches   for poisonwords
Ad Processor
                  • Checks    the value ranges

 Thrift           • Parses   text (dates, currencies, etc)
Objects
DIFF PHASE

ads t           ads t-1



                          • Performs   the diff between executions

         Diff
                          • Merges   the ads of both executions


        ads t
MATCHING PHASE

ads              External Data
                                 • Extracts   semantic information:

                                   • Geographical    information

                                   • Cars’ makes   and models
      Matching
                                   • Companies

  enriched                         • ...
    ads
EXPIRATION PHASE

   ads
               • Works   as a filter

               • Deletes:

  Expiration
                 • Expired   ads

                 • Incorrect   ads
ads to be
 indexed
DEDUPLICATION PHASE

                   • Duplicates   are a big issue for us
     ads
                   • Youcannot compare N ads against
                    each other

                   • Solution:
   Deduplication
                     • Use heuristics to create possible
                      duplicates groups
deduplicated
    ads              • Compare     all the ads of each group
INDEXING PHASE

   ads           • Is   actually done with two phases

                 • First   we create micro indexes

                   • We     use Embedded Solr Server
  Expiration
                 • Then    we merge them

                   • Plain   Lucene

Lucene Indexes
HIVE, PIG & CUSTOM TOOLS

            • Critical:

              • To   know what is going on (control info)

              • To   debug

              • To   prototype new processes

              • To   understand your data
grep, cat
              • To   create reports
MIGRATION PROCESS


• Used Amazon    EC2 to test different cluster configurations

• Maintained   both systems running during one month

• Switched   to the new system gradually, one country at a time

• Then   we moved the cluster to our own servers
FROM BATCH
              TO NEAR REAL-TIME
• Batch   is not enough

• Storm     for real time data processing

• HBase     for data storage

• Zookeeper      for systems coordination

• Putting   it all together

• Batch   and NRT. Mixed architecture
BATCH IS NOT ENOUGH


• Dataprocessing with map reduce scales well but takes time
 and has latency

• Crunch   documents in batch means wait until all is processed

• We   want to show the user fresher results!
BATCH IS NOT ENOUGH
                                                         ZK
          MR pipeline




            HDFS                        Id tables
                                           Solr
Storm + HBase + Zookeeper looks like a good fit!
       Topology



                                                    ZK

 Feeds      Spouts      Bolts   Bolts                    Slaves
STORM - PROPERTIES

• Distributed   real time computation system

• Fault   tolerance

• Horizontal    scalability

• Low     latency

• Reliability
STORM - COMPONENTS

• Tuple

• Stream

• Spout

• Bolt

• Topology
STORM IN ACTION
         Spouts      Bolts     Bolts




                     Streams
                        of
                      tuples




Queue             Topology             DataStore
STORM - DAEMONS


• Nimbus

• Supervisors

• Workers
HBASE - PROPERTIES

• Distributed, sorted    map datastore

• Automatic   failover

• Rows   are sorted

• Many   columns per row

• Good   Hadoop integration
HBASE - COMPONENTS


• Master

 • Slave   coordination and failure detection

 • Admin    features

• Region   server (slaves)
ZOOKEEPER


• Highly   available coordination system

• Used for locking, distributed configuration, leader election,
 cluster management...

• Curator   makes it easy for common algorithms
PUTTING IT ALL TOGETHER
                                                                   ZK
        MR pipeline




          HDFS                                    Id tables
                                                                   Solr
        Topology



                                                              ZK

Feeds     Spouts      Bolts processor   Bolt Indexer               Slaves
MIXED ARCHITECTURE


• Ifthe number of segments in the index gets too big is has an
  impact in search performance

• Building
         indexes in batch allows to keep small number of
  segments

• Gives   near real time updates and it’s tolerant to human error
THANK YOU!
  QUESTIONS?

Contenu connexe

Tendances

A Global In-memory Data System for MySQL
A Global In-memory Data System for MySQLA Global In-memory Data System for MySQL
A Global In-memory Data System for MySQLDaniel Austin
 
Tale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkTale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkKarthik Deivasigamani
 
Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Jon Haddad
 
Cmu-2011-09.pptx
Cmu-2011-09.pptxCmu-2011-09.pptx
Cmu-2011-09.pptxTed Dunning
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Spil Games @ FOSDEM: Galera Replicator IRL
Spil Games @ FOSDEM: Galera Replicator IRLSpil Games @ FOSDEM: Galera Replicator IRL
Spil Games @ FOSDEM: Galera Replicator IRLspil-engineering
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...ScyllaDB
 
Scaling Pinterest
Scaling PinterestScaling Pinterest
Scaling PinterestC4Media
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time CassandraAcunu
 
Life After Sharding: Monitoring and Management of a Complex Data Cloud
Life After Sharding: Monitoring and Management of a Complex Data CloudLife After Sharding: Monitoring and Management of a Complex Data Cloud
Life After Sharding: Monitoring and Management of a Complex Data CloudOSCON Byrum
 
NoSQL in Financial Industry - Pierre Bittner
NoSQL in Financial Industry - Pierre BittnerNoSQL in Financial Industry - Pierre Bittner
NoSQL in Financial Industry - Pierre Bittnerdistributed matters
 
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...ScyllaDB
 
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and InfrastrctureRevolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and Infrastrcturesabnees
 
Scylla Summit 2016: ScyllaDB, Present and Future
Scylla Summit 2016: ScyllaDB, Present and FutureScylla Summit 2016: ScyllaDB, Present and Future
Scylla Summit 2016: ScyllaDB, Present and FutureScyllaDB
 
Technological Innovations for Home Entertainment & Video Storage
 Technological Innovations for Home Entertainment & Video Storage Technological Innovations for Home Entertainment & Video Storage
Technological Innovations for Home Entertainment & Video StorageCK Chen
 
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup groupApache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup groupfvanvollenhoven
 
AWS re:Invent 2016: How Telltale Games migrated its story analytics from Apac...
AWS re:Invent 2016: How Telltale Games migrated its story analytics from Apac...AWS re:Invent 2016: How Telltale Games migrated its story analytics from Apac...
AWS re:Invent 2016: How Telltale Games migrated its story analytics from Apac...Amazon Web Services
 

Tendances (19)

A Global In-memory Data System for MySQL
A Global In-memory Data System for MySQLA Global In-memory Data System for MySQL
A Global In-memory Data System for MySQL
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Tale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkTale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache Flink
 
Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production (Nov 2015)
 
Cmu-2011-09.pptx
Cmu-2011-09.pptxCmu-2011-09.pptx
Cmu-2011-09.pptx
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Olivier_Tisserand_projects
Olivier_Tisserand_projectsOlivier_Tisserand_projects
Olivier_Tisserand_projects
 
Spil Games @ FOSDEM: Galera Replicator IRL
Spil Games @ FOSDEM: Galera Replicator IRLSpil Games @ FOSDEM: Galera Replicator IRL
Spil Games @ FOSDEM: Galera Replicator IRL
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
 
Scaling Pinterest
Scaling PinterestScaling Pinterest
Scaling Pinterest
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time Cassandra
 
Life After Sharding: Monitoring and Management of a Complex Data Cloud
Life After Sharding: Monitoring and Management of a Complex Data CloudLife After Sharding: Monitoring and Management of a Complex Data Cloud
Life After Sharding: Monitoring and Management of a Complex Data Cloud
 
NoSQL in Financial Industry - Pierre Bittner
NoSQL in Financial Industry - Pierre BittnerNoSQL in Financial Industry - Pierre Bittner
NoSQL in Financial Industry - Pierre Bittner
 
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...
 
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and InfrastrctureRevolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
 
Scylla Summit 2016: ScyllaDB, Present and Future
Scylla Summit 2016: ScyllaDB, Present and FutureScylla Summit 2016: ScyllaDB, Present and Future
Scylla Summit 2016: ScyllaDB, Present and Future
 
Technological Innovations for Home Entertainment & Video Storage
 Technological Innovations for Home Entertainment & Video Storage Technological Innovations for Home Entertainment & Video Storage
Technological Innovations for Home Entertainment & Video Storage
 
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup groupApache Spark talk @ The Amsterdam Applied Machine Learning meetup group
Apache Spark talk @ The Amsterdam Applied Machine Learning meetup group
 
AWS re:Invent 2016: How Telltale Games migrated its story analytics from Apac...
AWS re:Invent 2016: How Telltale Games migrated its story analytics from Apac...AWS re:Invent 2016: How Telltale Games migrated its story analytics from Apac...
AWS re:Invent 2016: How Telltale Games migrated its story analytics from Apac...
 

En vedette

Processing Large Graphs in Hadoop
Processing Large Graphs in HadoopProcessing Large Graphs in Hadoop
Processing Large Graphs in HadoopDani Solà Lagares
 
Novena A Los Corazones Unidos Por La Santidad De Los Sacerdotes
Novena A Los Corazones Unidos Por La Santidad De Los SacerdotesNovena A Los Corazones Unidos Por La Santidad De Los Sacerdotes
Novena A Los Corazones Unidos Por La Santidad De Los SacerdotesAmor Santo
 
Desarrollo en la nube
Desarrollo en la nubeDesarrollo en la nube
Desarrollo en la nubePideCurso
 
TABLA PERIÓDICA
TABLA PERIÓDICATABLA PERIÓDICA
TABLA PERIÓDICALUISRUBINOS
 
Tutorial Cómo insertar Videos y Musica con Goear en Foros
Tutorial Cómo insertar Videos y Musica con Goear en ForosTutorial Cómo insertar Videos y Musica con Goear en Foros
Tutorial Cómo insertar Videos y Musica con Goear en ForosOEI Capacitación
 
AWS Start-Up Tour 2009 / ShareThis
AWS Start-Up Tour 2009 / ShareThisAWS Start-Up Tour 2009 / ShareThis
AWS Start-Up Tour 2009 / ShareThisPaco Nathan
 
Paseo isla baja y exposición en garachico
Paseo isla baja y exposición en garachicoPaseo isla baja y exposición en garachico
Paseo isla baja y exposición en garachicosobreruedasclasicas
 
36 estrategias_chinas
 36 estrategias_chinas 36 estrategias_chinas
36 estrategias_chinasEliza Beth
 
Somos La Gente- We Are The People
Somos La Gente- We Are The PeopleSomos La Gente- We Are The People
Somos La Gente- We Are The PeopleMiryana P.
 
Teorias de Aprendizaje
Teorias de AprendizajeTeorias de Aprendizaje
Teorias de AprendizajeAndreaGlez
 
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Paco Nathan
 
1 Como registrarse - Docente - V3
1 Como registrarse - Docente - V31 Como registrarse - Docente - V3
1 Como registrarse - Docente - V3OEI Capacitación
 
100913 comunicado arpas ley radio comunitaria
100913 comunicado arpas ley radio comunitaria100913 comunicado arpas ley radio comunitaria
100913 comunicado arpas ley radio comunitariaMargarita Díaz
 
Allegro Serenissimo (por: carlitosrangel)
Allegro Serenissimo (por: carlitosrangel)Allegro Serenissimo (por: carlitosrangel)
Allegro Serenissimo (por: carlitosrangel)Carlos Rangel
 
Temática del blog
Temática del blogTemática del blog
Temática del blogmmartinez126
 
El holocausto
El holocaustoEl holocausto
El holocaustoFran Ü
 
Psicologia 2
Psicologia 2Psicologia 2
Psicologia 2Ar Kroly
 
Light to the Nations - Week 13
Light to the Nations - Week 13Light to the Nations - Week 13
Light to the Nations - Week 13PDEI
 

En vedette (20)

Processing Large Graphs in Hadoop
Processing Large Graphs in HadoopProcessing Large Graphs in Hadoop
Processing Large Graphs in Hadoop
 
Novena A Los Corazones Unidos Por La Santidad De Los Sacerdotes
Novena A Los Corazones Unidos Por La Santidad De Los SacerdotesNovena A Los Corazones Unidos Por La Santidad De Los Sacerdotes
Novena A Los Corazones Unidos Por La Santidad De Los Sacerdotes
 
Desarrollo en la nube
Desarrollo en la nubeDesarrollo en la nube
Desarrollo en la nube
 
TABLA PERIÓDICA
TABLA PERIÓDICATABLA PERIÓDICA
TABLA PERIÓDICA
 
Tutorial Cómo insertar Videos y Musica con Goear en Foros
Tutorial Cómo insertar Videos y Musica con Goear en ForosTutorial Cómo insertar Videos y Musica con Goear en Foros
Tutorial Cómo insertar Videos y Musica con Goear en Foros
 
AWS Start-Up Tour 2009 / ShareThis
AWS Start-Up Tour 2009 / ShareThisAWS Start-Up Tour 2009 / ShareThis
AWS Start-Up Tour 2009 / ShareThis
 
Lte presentacion
Lte presentacionLte presentacion
Lte presentacion
 
Paseo isla baja y exposición en garachico
Paseo isla baja y exposición en garachicoPaseo isla baja y exposición en garachico
Paseo isla baja y exposición en garachico
 
Didáctica de la matemática
Didáctica de la matemáticaDidáctica de la matemática
Didáctica de la matemática
 
36 estrategias_chinas
 36 estrategias_chinas 36 estrategias_chinas
36 estrategias_chinas
 
Somos La Gente- We Are The People
Somos La Gente- We Are The PeopleSomos La Gente- We Are The People
Somos La Gente- We Are The People
 
Teorias de Aprendizaje
Teorias de AprendizajeTeorias de Aprendizaje
Teorias de Aprendizaje
 
Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)Intro to Cascading (SpringOne2GX)
Intro to Cascading (SpringOne2GX)
 
1 Como registrarse - Docente - V3
1 Como registrarse - Docente - V31 Como registrarse - Docente - V3
1 Como registrarse - Docente - V3
 
100913 comunicado arpas ley radio comunitaria
100913 comunicado arpas ley radio comunitaria100913 comunicado arpas ley radio comunitaria
100913 comunicado arpas ley radio comunitaria
 
Allegro Serenissimo (por: carlitosrangel)
Allegro Serenissimo (por: carlitosrangel)Allegro Serenissimo (por: carlitosrangel)
Allegro Serenissimo (por: carlitosrangel)
 
Temática del blog
Temática del blogTemática del blog
Temática del blog
 
El holocausto
El holocaustoEl holocausto
El holocausto
 
Psicologia 2
Psicologia 2Psicologia 2
Psicologia 2
 
Light to the Nations - Week 13
Light to the Nations - Week 13Light to the Nations - Week 13
Light to the Nations - Week 13
 

Similaire à From legacy, to batch, to near real-time

Modern software architectures - PHP UK Conference 2015
Modern software architectures - PHP UK Conference 2015Modern software architectures - PHP UK Conference 2015
Modern software architectures - PHP UK Conference 2015Ricard Clau
 
VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012Eonblast
 
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Cloudera, Inc.
 
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLBig Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLTugdual Grall
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Krishnan Parasuraman
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitterRoger Xia
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
 
NOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the CloudNOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the Cloudboorad
 
Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Nathan Bijnens
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Miguel Pastor
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterJohn Adams
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 
Big Data! Great! Now What? #SymfonyCon 2014
Big Data! Great! Now What? #SymfonyCon 2014Big Data! Great! Now What? #SymfonyCon 2014
Big Data! Great! Now What? #SymfonyCon 2014Ricard Clau
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
What ya gonna do?
What ya gonna do?What ya gonna do?
What ya gonna do?CQD
 

Similaire à From legacy, to batch, to near real-time (20)

Modern software architectures - PHP UK Conference 2015
Modern software architectures - PHP UK Conference 2015Modern software architectures - PHP UK Conference 2015
Modern software architectures - PHP UK Conference 2015
 
VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012VoltDB and Erlang - Tech planet 2012
VoltDB and Erlang - Tech planet 2012
 
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
 
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLBig Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQL
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
NOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the CloudNOSQL, CouchDB, and the Cloud
NOSQL, CouchDB, and the Cloud
 
Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013
 
Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014Liferay & Big Data Dev Con 2014
Liferay & Big Data Dev Con 2014
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
Big Data! Great! Now What? #SymfonyCon 2014
Big Data! Great! Now What? #SymfonyCon 2014Big Data! Great! Now What? #SymfonyCon 2014
Big Data! Great! Now What? #SymfonyCon 2014
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
What ya gonna do?
What ya gonna do?What ya gonna do?
What ya gonna do?
 

Dernier

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Dernier (20)

Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

From legacy, to batch, to near real-time

  • 1. FROM LEGACY, TO BATCH, TO NEAR REAL-TIME Marc Sturlese, Dani Solà
  • 2. WHO ARE WE? • Marc Sturlese - @sturlese • Backend engineer, focused on R&D • Interests: search, scalability • Dani Solà - @dani_sola • Backend engineer • Interests: distributed systems, data mining, search,...
  • 3. TROVIT Search engine for classifieds: 6 verticals, 38 countries & growing
  • 4.
  • 5. FROM LEGACY TO BATCH • Old architecture • Why & when we changed • Current architecture • Hive, Pig & custom tools • Migration process
  • 6. OLD ARCHITECTURE • Based on MySQL and PHP scripts • Indexes created with DataImportHandler Incoming data DataImportHandler Lucene Indexes MySQL PHP Scripts
  • 7. WHEN & WHY WE MOVED • Sharded strategies are hard to maintain • We had 10M rows in a single table • Many processes working on MySQL databases • We wanted a more maintainable codebase • The solution was pretty obvious...
  • 8. CURRENT ARCHITECTURE • Based on Hadoop • Batch process that reprocess all the ads... • But needs to be aware of the previous execution! • Hive & custom tools to know what happens
  • 9. CURRENT ARCHITECTURE Incoming data External Data Lucene Indexes Deployment Ad Processor Diff Matching Expiration Deduplication Indexing t-1 Hadoop Cluster Hive Stats
  • 10. AD PROCESSOR Incoming data • Converts text files to Thrift objects • Checks that the ads are complete • Searches for poisonwords Ad Processor • Checks the value ranges Thrift • Parses text (dates, currencies, etc) Objects
  • 11. DIFF PHASE ads t ads t-1 • Performs the diff between executions Diff • Merges the ads of both executions ads t
  • 12. MATCHING PHASE ads External Data • Extracts semantic information: • Geographical information • Cars’ makes and models Matching • Companies enriched • ... ads
  • 13. EXPIRATION PHASE ads • Works as a filter • Deletes: Expiration • Expired ads • Incorrect ads ads to be indexed
  • 14. DEDUPLICATION PHASE • Duplicates are a big issue for us ads • Youcannot compare N ads against each other • Solution: Deduplication • Use heuristics to create possible duplicates groups deduplicated ads • Compare all the ads of each group
  • 15. INDEXING PHASE ads • Is actually done with two phases • First we create micro indexes • We use Embedded Solr Server Expiration • Then we merge them • Plain Lucene Lucene Indexes
  • 16. HIVE, PIG & CUSTOM TOOLS • Critical: • To know what is going on (control info) • To debug • To prototype new processes • To understand your data grep, cat • To create reports
  • 17. MIGRATION PROCESS • Used Amazon EC2 to test different cluster configurations • Maintained both systems running during one month • Switched to the new system gradually, one country at a time • Then we moved the cluster to our own servers
  • 18. FROM BATCH TO NEAR REAL-TIME • Batch is not enough • Storm for real time data processing • HBase for data storage • Zookeeper for systems coordination • Putting it all together • Batch and NRT. Mixed architecture
  • 19. BATCH IS NOT ENOUGH • Dataprocessing with map reduce scales well but takes time and has latency • Crunch documents in batch means wait until all is processed • We want to show the user fresher results!
  • 20. BATCH IS NOT ENOUGH ZK MR pipeline HDFS Id tables Solr Storm + HBase + Zookeeper looks like a good fit! Topology ZK Feeds Spouts Bolts Bolts Slaves
  • 21. STORM - PROPERTIES • Distributed real time computation system • Fault tolerance • Horizontal scalability • Low latency • Reliability
  • 22. STORM - COMPONENTS • Tuple • Stream • Spout • Bolt • Topology
  • 23. STORM IN ACTION Spouts Bolts Bolts Streams of tuples Queue Topology DataStore
  • 24. STORM - DAEMONS • Nimbus • Supervisors • Workers
  • 25. HBASE - PROPERTIES • Distributed, sorted map datastore • Automatic failover • Rows are sorted • Many columns per row • Good Hadoop integration
  • 26. HBASE - COMPONENTS • Master • Slave coordination and failure detection • Admin features • Region server (slaves)
  • 27. ZOOKEEPER • Highly available coordination system • Used for locking, distributed configuration, leader election, cluster management... • Curator makes it easy for common algorithms
  • 28. PUTTING IT ALL TOGETHER ZK MR pipeline HDFS Id tables Solr Topology ZK Feeds Spouts Bolts processor Bolt Indexer Slaves
  • 29. MIXED ARCHITECTURE • Ifthe number of segments in the index gets too big is has an impact in search performance • Building indexes in batch allows to keep small number of segments • Gives near real time updates and it’s tolerant to human error
  • 30. THANK YOU! QUESTIONS?

Notes de l'éditeur

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n