SlideShare une entreprise Scribd logo
1  sur  117
Télécharger pour lire hors ligne
Python <3 Content systems
                          - managing millions of tracks for the masses




                                                                         22nd October 2012

Tuesday, October 23, 12
Tuesday, October 23, 12
Tuesday, October 23, 12
Tuesday, October 23, 12
Tuesday, October 23, 12
Tuesday, October 23, 12
Tuesday, October 23, 12
Tuesday, October 23, 12
> 15 M active users*
                          * Users active within the previous 30 days
Tuesday, October 23, 12
> Available in 15 Countries

   > 15 M active users*
                                       * Users active within the previous 30 days
Tuesday, October 23, 12
> 18 M tracks

                    > Available in 15 Countries

   > 15 M active users*
                                          * Users active within the previous 30 days
Tuesday, October 23, 12
> 20 k new tracks added per day

                          > 18 M tracks

                    > Available in 15 Countries

   > 15 M active users*
                                          * Users active within the previous 30 days
Tuesday, October 23, 12
> 1 century of listening
                               > 20 k new tracks added per day

                          > 18 M tracks

                    > Available in 15 Countries

   > 15 M active users*
                                          * Users active within the previous 30 days
Tuesday, October 23, 12
> 500 M playlists
                                          > 1 century of listening
                               > 20 k new tracks added per day

                          > 18 M tracks

                    > Available in 15 Countries

   > 15 M active users*
                                          * Users active within the previous 30 days
Tuesday, October 23, 12
Service overview




Tuesday, October 23, 12
Service overview


                          Storage




Tuesday, October 23, 12
Service overview


                          Storage


                           User




Tuesday, October 23, 12
Service overview


                          Storage


                           User


                          Search




Tuesday, October 23, 12
Service overview


                          Storage


                            User


                           Search


                          Metadata




Tuesday, October 23, 12
Service overview


                          Storage


                            User


                           Search


                          Metadata
                             .
                             .
                             .




Tuesday, October 23, 12
Service overview


                          Storage


                            User
                                     AP
                           Search


                          Metadata
                             .
                             .
                             .




Tuesday, October 23, 12
Service overview


                          Storage


                            User
                                     AP
                           Search


                          Metadata
                             .
                             .
                             .




Tuesday, October 23, 12
Service overview


                          Storage


                            User
                                     AP
                           Search


                          Metadata
                             .
                             .
                             .




Tuesday, October 23, 12
Service overview


                          Storage


                            User
                                     AP
                           Search


                          Metadata
                             .
                             .
                             .




Tuesday, October 23, 12
Content pipeline




    Label A

   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on
                                                    e s
    Label A                                     n g
                                               I
   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Ingestion




                                   XM       L L
                                           M M
                                         LX MX
                                           X L




                          Background image: lord enfield (CC BY 2.0) http://www.flickr.com/photos/42424413@N06/5064658450/
Tuesday, October 23, 12
Ingestion: Delivery formats




Tuesday, October 23, 12
Ingestion: Delivery formats


             ~ 10 different incoming XML formats




Tuesday, October 23, 12
Ingestion: Delivery formats


             ~ 10 different incoming XML formats
                     - Proprietary formats (majors)




Tuesday, October 23, 12
Ingestion: Delivery formats


             ~ 10 different incoming XML formats
                     - Proprietary formats (majors)
                     - Spotify delivery format (mostly indies)




Tuesday, October 23, 12
Ingestion: Delivery formats


             ~ 10 different incoming XML formats
                     - Proprietary formats (majors)
                     - Spotify delivery format (mostly indies)
             Thousands of lines of source specific code




Tuesday, October 23, 12
Data model [simplified]



                                                  1   Artist                   Transcoding
                                                           *                            *

                                *
                      Album         1                                               1



                                            *   Disc   1
                                                                                1   Audio
                                                                    *      1
                                                               *
                                                                   Track
                            *
                          Rights        *




Tuesday, October 23, 12
Ingestion




                          LXML and XSLT with extensions for
                          parsing/transforming XML




Tuesday, October 23, 12
Ingestion: XPath extensions
     >>> def formerlify(_, name):
     ...    return 'The artist formerly known as %s' %name

     >>>        #Namespace stuff
     >>>        from lxml import etree
     >>>        ns = etree.FunctionNamespace('http://my.org/myfunctions')
     >>>        ns['hello'] = hello
     >>>        ns.prefix = 'f'

     >>> root = etree.XML('<a><b>Prince</b></a>')
     >>> print(root.xpath('f:hello(string(b))'))

     ... The artist formerly known as Prince




                          http://lxml.de/extensions.html#xpath-extension-functions

Tuesday, October 23, 12
Ingestion




Tuesday, October 23, 12
Ingestion
          Fun (?!) fact: largest XML file seen so far had 3.3 million rows taking up
          350 MB of disk space




Tuesday, October 23, 12
Ingestion
          Fun (?!) fact: largest XML file seen so far had 3.3 million rows taking up
          350 MB of disk space

          Bible apparently fits in 3MB XML




Tuesday, October 23, 12
Ingestion
          Fun (?!) fact: largest XML file seen so far had 3.3 million rows taking up
          350 MB of disk space

          Bible apparently fits in 3MB XML
                   >>> timeit.timeit('e.parse("huge.xml")',
                                     setup='import lxml.etree as e',
                                     number=5) / 5
                   4.19...

                   >>> timeit.timeit('e.parse("huge.xml")',
                                     setup='import xml.etree.cElementTree as e',
                                     number=5) / 5
                   4.78...

                   >>> timeit.timeit('e.parse("huge.xml")',
                                     setup='import xml.etree.ElementTree as e',
                                     number=5) / 5
                   55.39...




Tuesday, October 23, 12
Content pipeline




    Label A

   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on
                                                    e s
    Label A                                     n g
                                               I
   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on          g e
                                                    e s              e r
    Label A                                     n g               M
                                               I
   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Centralized vs. aggregated cataloging




          Requ                               Requ
                          ires h                 ires m
                                   uman                ergin
                                        s!                  g!




Tuesday, October 23, 12
Metadata - challenges




                          Image: Nicolas Genin (CC BY 2.0) http://www.flickr.com/photos/22785954@N08
Tuesday, October 23, 12
Content pipeline




    Label A

   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on
                                                    e s
    Label A                                     n g
                                               I
   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on          g e
                                                    e s              e r
    Label A                                     n g               M
                                               I
   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on          g e
                                                    e s              e r
    Label A                                     n g               M
                                               I
   Label B

  Label C

   Label D                                                  Curation/enrichment




                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Ambiguous artists - thesis work




Tuesday, October 23, 12
Ambiguous artists - thesis work


    • User input




Tuesday, October 23, 12
Ambiguous artists - thesis work


    • User input
    • Machine learning




Tuesday, October 23, 12
Ambiguous artists - thesis work


    • User input
    • Machine learning
    • Matching against external sources




Tuesday, October 23, 12
Ambiguous artists - thesis work


    •       User input
    •       Machine learning
    •       Matching against external sources
    •       Feature selection (#matches per external
            source, len(name), country-count,
            multilingual)




Tuesday, October 23, 12
Ambiguous artists - thesis work


    •       User input
    •       Machine learning
    •       Matching against external sources
    •       Feature selection (#matches per external
            source, len(name), country-count,
            multilingual)
    • Matchings + preprocessing in Python


Tuesday, October 23, 12
Content matching




                          (16 * 10 ** 6) ** 2




Tuesday, October 23, 12
Content matching




                          (16 * 10 ** 6) ** 2 = A large number




Tuesday, October 23, 12
Content matching




                          (16 * 10 ** 6) ** 2 = A large number

 Reduce search space:
 >>> from unicodedata import normalize
 >>> key = ''.join(normalize('NFD', char)[0].lower() for char in title)[5]




Tuesday, October 23, 12
Content matching




                          (16 * 10 ** 6) ** 2 = A large number

 Reduce search space:
 >>> from unicodedata import normalize
 >>> key = ''.join(normalize('NFD', char)[0].lower() for char in title)[5]




                                   Side note: Levenshtein (edit) distance is a heavy operation

                                   -> speeded up about 4x with pypy (or use c-extension)



Tuesday, October 23, 12
Automatic data processing will never be perfect




Tuesday, October 23, 12
it!
                                           h
                      Automatic data processing will never be perfect
                                         c
                                     a t
                                    P



Tuesday, October 23, 12
Content pipeline




    Label A

   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on
                                                    e s
    Label A                                     n g
                                               I
   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on          g e
                                                    e s              e r
    Label A                                     n g               M
                                               I
   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on          g e
                                                    e s              e r
    Label A                                     n g               M
                                               I
   Label B

  Label C

   Label D                                                  Curation/enrichment




                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on          g e
                                                    e s              e r
    Label A                                     n g               M
                                               I
   Label B

  Label C

   Label D                                                  Curation/enrichment
                                                                             g
                                                                          in
                                                                       od
                                                                  n sc
                                                              Tra
                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Transcoding



                          Asynchronous

                            RabbitMQ + amqplib

                Master / workers


Tuesday, October 23, 12
Content pipeline




    Label A

   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on
                                                    e s
    Label A                                     n g
                                               I
   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on          g e
                                                    e s              e r
    Label A                                     n g               M
                                               I
   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on          g e
                                                    e s              e r
    Label A                                     n g               M
                                               I
   Label B

  Label C

   Label D                                                  Curation/enrichment




                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on          g e
                                                    e s              e r
    Label A                                     n g               M
                                               I
   Label B

  Label C

   Label D                                                  Curation/enrichment
                                                                             g
                                                                          in
                                                                       od
                                                                  n sc
                                                              Tra
                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on            e                    n g
                                                      s              e r g
                                                                                          e xi
                                                  g e                                    d
    Label A
                                               In                 M                   In
   Label B

  Label C

   Label D                                                  Curation/enrichment
                                                                             g
                                                                          in
                                                                       od
                                                                  n sc
                                                              Tra
                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Index build




Tuesday, October 23, 12
Index build



     • Nightly batch job on db-dumps




Tuesday, October 23, 12
Index build



     • Nightly batch job on db-dumps
     • Previously mostly python but now moved to Java for
             performance reason




Tuesday, October 23, 12
Index build



     • Nightly batch job on db-dumps
     • Previously mostly python but now moved to Java for
             performance reason
     • But still lots of python helper scripts :)




Tuesday, October 23, 12
Content pipeline




    Label A

   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on
                                                    e s
    Label A                                     n g
                                               I
   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on          g e
                                                    e s              e r
    Label A                                     n g               M
                                               I
   Label B

  Label C

   Label D




                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on          g e
                                                    e s              e r
    Label A                                     n g               M
                                               I
   Label B

  Label C

   Label D                                                  Curation/enrichment




                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on          g e
                                                    e s              e r
    Label A                                     n g               M
                                               I
   Label B

  Label C

   Label D                                                  Curation/enrichment
                                                                             g
                                                                          in
                                                                       od
                                                                  n sc
                                                              Tra
                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline




                                                          ti on            e                    n g
                                                      s              e r g
                                                                                          e xi
                                                  g e                                    d
    Label A
                                               In                 M                   In
   Label B

  Label C

   Label D                                                  Curation/enrichment
                                                                             g
                                                                          in
                                                                       od
                                                                  n sc
                                                              Tra
                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Content pipeline



                                                                                                                          g
                                                             on            e                    n g                    in
                                                      s   ti           r g                  xi                   l is
                                                                                                                     h
                                                    e                e                   de                     b
    Label A                                     n g               M                   In                      u
                                               I                                                             P
   Label B

  Label C

   Label D                                                  Curation/enrichment
                                                                             g                                                On site live services,
                                                                          in
                                                                       od
                                                                                                                              e.g. search, browse

                                                                  n sc
                                                              Tra
                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Distribution/publish   Service A




                                         Service B




                             Service C




Tuesday, October 23, 12
Distribution/publish              Service A




                          Index A
                                                    Service B
                              Index B
                          Index C




                                        Service C




Tuesday, October 23, 12
Distribution/publish              Service A




                          Index A
                                                    Service B
                              Index B
                          Index C




                                        Service C




Tuesday, October 23, 12
Distribution/publish              Service A




                          Index A
                                                    Service B
                              Index B
                          Index C




                                        Service C




Tuesday, October 23, 12
Distribution/publish              Service A




                          Index A
                                                    Service B
                              Index B
                          Index C




                                        Service C




Tuesday, October 23, 12
Distribution/publish              Service A




                          Index A
                                                    Service B
                              Index B
                          Index C




                                        Service C




Tuesday, October 23, 12
Scheduling being migrated to ZooKeeper




                          image: http://www.flickr.com/photos/seattlemunicipalarchives/with/3797940791/

Tuesday, October 23, 12
Distribution/publish




                             Staged rollout



Tuesday, October 23, 12
Distribution/publish




Tuesday, October 23, 12
Distribution/publish




                             Exponential back-off




Tuesday, October 23, 12
Distribution/publish




                             Exponential back-off
                             waiting 5s ...




Tuesday, October 23, 12
Distribution/publish




                             Exponential back-off
                             waiting 5s ...
                             waiting 10s ...




Tuesday, October 23, 12
Distribution/publish




                             Exponential back-off
                             waiting 5s ...
                             waiting 10s ...
                             waiting 30s ...




Tuesday, October 23, 12
Distribution/publish




                             Exponential back-off
                             waiting   5s ...
                             waiting   10s ...
                             waiting   30s ...
                             waiting   60s ...




Tuesday, October 23, 12
Content pipeline



                                                                                                                          g
                                                             on            e                    n g                    in
                                                      s   ti           r g                  xi                   l is
                                                                                                                     h
                                                    e                e                   de                     b
    Label A                                     n g               M                   In                      u
                                               I                                                             P
   Label B

  Label C

   Label D                                                  Curation/enrichment
                                                                             g                                                On site live services,
                                                                          in
                                                                       od
                                                                                                                              e.g. search, browse

                                                                  n sc
                                                              Tra
                          Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/
Tuesday, October 23, 12
Store ’da data



Tuesday, October 23, 12
Choice of database




Tuesday, October 23, 12
Choice of database

                    Depends on the use case - duh!




Tuesday, October 23, 12
Choice of database

                    Depends on the use case - duh!
                    • PostgreSQL (e.g. user service)




Tuesday, October 23, 12
Choice of database

                    Depends on the use case - duh!
                    • PostgreSQL (e.g. user service)
                    • Cassandra (e.g. playlist service)




Tuesday, October 23, 12
Choice of database

                    Depends on the use case - duh!
                    • PostgreSQL (e.g. user service)
                    • Cassandra (e.g. playlist service)
                    • Tokyo cabinet (e.g. browse service)




Tuesday, October 23, 12
Choice of database

                    Depends on the use case - duh!
                    •     PostgreSQL (e.g. user service)
                    •     Cassandra (e.g. playlist service)
                    •     Tokyo cabinet (e.g. browse service)
                    •     Lucene (search service)




Tuesday, October 23, 12
Choice of database

                    Depends on the use case - duh!
                    •     PostgreSQL (e.g. user service)
                    •     Cassandra (e.g. playlist service)
                    •     Tokyo cabinet (e.g. browse service)
                    •     Lucene (search service)
                    •     HDFS




Tuesday, October 23, 12
PostgreSQL




                                                          [Pic. of elephant]




                          Image: http2007 (CC BY 2.0) http://www.flickr.com/photos/42424413@N06/5064658450/
Tuesday, October 23, 12
PostgreSQL




                          Redundancy + scaling:
                          master/slave



Tuesday, October 23, 12
PostgreSQL




                          Joins and subqueries -
                          let the query planner roll!


Tuesday, October 23, 12
PostgreSQL


          Python?




Tuesday, October 23, 12
PostgreSQL


          Python?
                          - psycopg2 + SQL-queries
                          - SQLAlchemy migrator for
                          versioning of db-schemas




Tuesday, October 23, 12
PostgreSQL


          Python?
                          - psycopg2 + SQL-queries
                          - SQLAlchemy migrator for
              !
                          versioning of db-schemas
            p
       Ti
             Server side, aka named, cursors:
             conn = psycopg2.connect(database='huge_db', user='postgres',
                                     password='secret')
             sscursor = conn.cursor('my_cursor')
             sscursor.execute('SELECT * FROM big_table')
             rows = sscursor.fetchmany(1000)
             ...


Tuesday, October 23, 12
Scaling the content pipeline




                           What to scale for?



Tuesday, October 23, 12
Scaling the content pipeline




                               Size of catalog



Tuesday, October 23, 12
Scaling the content pipeline




                                     # Users



Tuesday, October 23, 12
Thank you
                          henok@spotify.com




Tuesday, October 23, 12
Distribution/publish




                          Popen + gevent (although IO-bound)
                          import gevent

                          gevent.monkey.patch_all()

                          def _wait(self):
                              while True:
                                  res = self.poll()
                                  if res is not None:
                                      return res
                                  gevent.sleep(0.1)

                          subprocess.Popen.wait = _wait


Tuesday, October 23, 12

Contenu connexe

Dernier

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

En vedette

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

En vedette (20)

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 

Python &lt;3 Content systems

  • 1. Python <3 Content systems - managing millions of tracks for the masses 22nd October 2012 Tuesday, October 23, 12
  • 9. > 15 M active users* * Users active within the previous 30 days Tuesday, October 23, 12
  • 10. > Available in 15 Countries > 15 M active users* * Users active within the previous 30 days Tuesday, October 23, 12
  • 11. > 18 M tracks > Available in 15 Countries > 15 M active users* * Users active within the previous 30 days Tuesday, October 23, 12
  • 12. > 20 k new tracks added per day > 18 M tracks > Available in 15 Countries > 15 M active users* * Users active within the previous 30 days Tuesday, October 23, 12
  • 13. > 1 century of listening > 20 k new tracks added per day > 18 M tracks > Available in 15 Countries > 15 M active users* * Users active within the previous 30 days Tuesday, October 23, 12
  • 14. > 500 M playlists > 1 century of listening > 20 k new tracks added per day > 18 M tracks > Available in 15 Countries > 15 M active users* * Users active within the previous 30 days Tuesday, October 23, 12
  • 16. Service overview Storage Tuesday, October 23, 12
  • 17. Service overview Storage User Tuesday, October 23, 12
  • 18. Service overview Storage User Search Tuesday, October 23, 12
  • 19. Service overview Storage User Search Metadata Tuesday, October 23, 12
  • 20. Service overview Storage User Search Metadata . . . Tuesday, October 23, 12
  • 21. Service overview Storage User AP Search Metadata . . . Tuesday, October 23, 12
  • 22. Service overview Storage User AP Search Metadata . . . Tuesday, October 23, 12
  • 23. Service overview Storage User AP Search Metadata . . . Tuesday, October 23, 12
  • 24. Service overview Storage User AP Search Metadata . . . Tuesday, October 23, 12
  • 25. Content pipeline Label A Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 26. Content pipeline ti on e s Label A n g I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 27. Ingestion XM L L M M LX MX X L Background image: lord enfield (CC BY 2.0) http://www.flickr.com/photos/42424413@N06/5064658450/ Tuesday, October 23, 12
  • 29. Ingestion: Delivery formats ~ 10 different incoming XML formats Tuesday, October 23, 12
  • 30. Ingestion: Delivery formats ~ 10 different incoming XML formats - Proprietary formats (majors) Tuesday, October 23, 12
  • 31. Ingestion: Delivery formats ~ 10 different incoming XML formats - Proprietary formats (majors) - Spotify delivery format (mostly indies) Tuesday, October 23, 12
  • 32. Ingestion: Delivery formats ~ 10 different incoming XML formats - Proprietary formats (majors) - Spotify delivery format (mostly indies) Thousands of lines of source specific code Tuesday, October 23, 12
  • 33. Data model [simplified] 1 Artist Transcoding * * * Album 1 1 * Disc 1 1 Audio * 1 * Track * Rights * Tuesday, October 23, 12
  • 34. Ingestion LXML and XSLT with extensions for parsing/transforming XML Tuesday, October 23, 12
  • 35. Ingestion: XPath extensions >>> def formerlify(_, name): ... return 'The artist formerly known as %s' %name >>> #Namespace stuff >>> from lxml import etree >>> ns = etree.FunctionNamespace('http://my.org/myfunctions') >>> ns['hello'] = hello >>> ns.prefix = 'f' >>> root = etree.XML('<a><b>Prince</b></a>') >>> print(root.xpath('f:hello(string(b))')) ... The artist formerly known as Prince http://lxml.de/extensions.html#xpath-extension-functions Tuesday, October 23, 12
  • 37. Ingestion Fun (?!) fact: largest XML file seen so far had 3.3 million rows taking up 350 MB of disk space Tuesday, October 23, 12
  • 38. Ingestion Fun (?!) fact: largest XML file seen so far had 3.3 million rows taking up 350 MB of disk space Bible apparently fits in 3MB XML Tuesday, October 23, 12
  • 39. Ingestion Fun (?!) fact: largest XML file seen so far had 3.3 million rows taking up 350 MB of disk space Bible apparently fits in 3MB XML >>> timeit.timeit('e.parse("huge.xml")', setup='import lxml.etree as e', number=5) / 5 4.19... >>> timeit.timeit('e.parse("huge.xml")', setup='import xml.etree.cElementTree as e', number=5) / 5 4.78... >>> timeit.timeit('e.parse("huge.xml")', setup='import xml.etree.ElementTree as e', number=5) / 5 55.39... Tuesday, October 23, 12
  • 40. Content pipeline Label A Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 41. Content pipeline ti on e s Label A n g I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 42. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 43. Centralized vs. aggregated cataloging Requ Requ ires h ires m uman ergin s! g! Tuesday, October 23, 12
  • 44. Metadata - challenges Image: Nicolas Genin (CC BY 2.0) http://www.flickr.com/photos/22785954@N08 Tuesday, October 23, 12
  • 45. Content pipeline Label A Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 46. Content pipeline ti on e s Label A n g I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 47. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 48. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 49. Ambiguous artists - thesis work Tuesday, October 23, 12
  • 50. Ambiguous artists - thesis work • User input Tuesday, October 23, 12
  • 51. Ambiguous artists - thesis work • User input • Machine learning Tuesday, October 23, 12
  • 52. Ambiguous artists - thesis work • User input • Machine learning • Matching against external sources Tuesday, October 23, 12
  • 53. Ambiguous artists - thesis work • User input • Machine learning • Matching against external sources • Feature selection (#matches per external source, len(name), country-count, multilingual) Tuesday, October 23, 12
  • 54. Ambiguous artists - thesis work • User input • Machine learning • Matching against external sources • Feature selection (#matches per external source, len(name), country-count, multilingual) • Matchings + preprocessing in Python Tuesday, October 23, 12
  • 55. Content matching (16 * 10 ** 6) ** 2 Tuesday, October 23, 12
  • 56. Content matching (16 * 10 ** 6) ** 2 = A large number Tuesday, October 23, 12
  • 57. Content matching (16 * 10 ** 6) ** 2 = A large number Reduce search space: >>> from unicodedata import normalize >>> key = ''.join(normalize('NFD', char)[0].lower() for char in title)[5] Tuesday, October 23, 12
  • 58. Content matching (16 * 10 ** 6) ** 2 = A large number Reduce search space: >>> from unicodedata import normalize >>> key = ''.join(normalize('NFD', char)[0].lower() for char in title)[5] Side note: Levenshtein (edit) distance is a heavy operation -> speeded up about 4x with pypy (or use c-extension) Tuesday, October 23, 12
  • 59. Automatic data processing will never be perfect Tuesday, October 23, 12
  • 60. it! h Automatic data processing will never be perfect c a t P Tuesday, October 23, 12
  • 61. Content pipeline Label A Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 62. Content pipeline ti on e s Label A n g I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 63. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 64. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 65. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment g in od n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 66. Transcoding Asynchronous RabbitMQ + amqplib Master / workers Tuesday, October 23, 12
  • 67. Content pipeline Label A Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 68. Content pipeline ti on e s Label A n g I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 69. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 70. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 71. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment g in od n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 72. Content pipeline ti on e n g s e r g e xi g e d Label A In M In Label B Label C Label D Curation/enrichment g in od n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 74. Index build • Nightly batch job on db-dumps Tuesday, October 23, 12
  • 75. Index build • Nightly batch job on db-dumps • Previously mostly python but now moved to Java for performance reason Tuesday, October 23, 12
  • 76. Index build • Nightly batch job on db-dumps • Previously mostly python but now moved to Java for performance reason • But still lots of python helper scripts :) Tuesday, October 23, 12
  • 77. Content pipeline Label A Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 78. Content pipeline ti on e s Label A n g I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 79. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 80. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 81. Content pipeline ti on g e e s e r Label A n g M I Label B Label C Label D Curation/enrichment g in od n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 82. Content pipeline ti on e n g s e r g e xi g e d Label A In M In Label B Label C Label D Curation/enrichment g in od n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 83. Content pipeline g on e n g in s ti r g xi l is h e e de b Label A n g M In u I P Label B Label C Label D Curation/enrichment g On site live services, in od e.g. search, browse n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 84. Distribution/publish Service A Service B Service C Tuesday, October 23, 12
  • 85. Distribution/publish Service A Index A Service B Index B Index C Service C Tuesday, October 23, 12
  • 86. Distribution/publish Service A Index A Service B Index B Index C Service C Tuesday, October 23, 12
  • 87. Distribution/publish Service A Index A Service B Index B Index C Service C Tuesday, October 23, 12
  • 88. Distribution/publish Service A Index A Service B Index B Index C Service C Tuesday, October 23, 12
  • 89. Distribution/publish Service A Index A Service B Index B Index C Service C Tuesday, October 23, 12
  • 90. Scheduling being migrated to ZooKeeper image: http://www.flickr.com/photos/seattlemunicipalarchives/with/3797940791/ Tuesday, October 23, 12
  • 91. Distribution/publish Staged rollout Tuesday, October 23, 12
  • 93. Distribution/publish Exponential back-off Tuesday, October 23, 12
  • 94. Distribution/publish Exponential back-off waiting 5s ... Tuesday, October 23, 12
  • 95. Distribution/publish Exponential back-off waiting 5s ... waiting 10s ... Tuesday, October 23, 12
  • 96. Distribution/publish Exponential back-off waiting 5s ... waiting 10s ... waiting 30s ... Tuesday, October 23, 12
  • 97. Distribution/publish Exponential back-off waiting 5s ... waiting 10s ... waiting 30s ... waiting 60s ... Tuesday, October 23, 12
  • 98. Content pipeline g on e n g in s ti r g xi l is h e e de b Label A n g M In u I P Label B Label C Label D Curation/enrichment g On site live services, in od e.g. search, browse n sc Tra Image: Steve Juvertson (CC BY 2.0) http://www.flickr.com/photos/jurvetson/916142/ Tuesday, October 23, 12
  • 99. Store ’da data Tuesday, October 23, 12
  • 100. Choice of database Tuesday, October 23, 12
  • 101. Choice of database Depends on the use case - duh! Tuesday, October 23, 12
  • 102. Choice of database Depends on the use case - duh! • PostgreSQL (e.g. user service) Tuesday, October 23, 12
  • 103. Choice of database Depends on the use case - duh! • PostgreSQL (e.g. user service) • Cassandra (e.g. playlist service) Tuesday, October 23, 12
  • 104. Choice of database Depends on the use case - duh! • PostgreSQL (e.g. user service) • Cassandra (e.g. playlist service) • Tokyo cabinet (e.g. browse service) Tuesday, October 23, 12
  • 105. Choice of database Depends on the use case - duh! • PostgreSQL (e.g. user service) • Cassandra (e.g. playlist service) • Tokyo cabinet (e.g. browse service) • Lucene (search service) Tuesday, October 23, 12
  • 106. Choice of database Depends on the use case - duh! • PostgreSQL (e.g. user service) • Cassandra (e.g. playlist service) • Tokyo cabinet (e.g. browse service) • Lucene (search service) • HDFS Tuesday, October 23, 12
  • 107. PostgreSQL [Pic. of elephant] Image: http2007 (CC BY 2.0) http://www.flickr.com/photos/42424413@N06/5064658450/ Tuesday, October 23, 12
  • 108. PostgreSQL Redundancy + scaling: master/slave Tuesday, October 23, 12
  • 109. PostgreSQL Joins and subqueries - let the query planner roll! Tuesday, October 23, 12
  • 110. PostgreSQL Python? Tuesday, October 23, 12
  • 111. PostgreSQL Python? - psycopg2 + SQL-queries - SQLAlchemy migrator for versioning of db-schemas Tuesday, October 23, 12
  • 112. PostgreSQL Python? - psycopg2 + SQL-queries - SQLAlchemy migrator for ! versioning of db-schemas p Ti Server side, aka named, cursors: conn = psycopg2.connect(database='huge_db', user='postgres', password='secret') sscursor = conn.cursor('my_cursor') sscursor.execute('SELECT * FROM big_table') rows = sscursor.fetchmany(1000) ... Tuesday, October 23, 12
  • 113. Scaling the content pipeline What to scale for? Tuesday, October 23, 12
  • 114. Scaling the content pipeline Size of catalog Tuesday, October 23, 12
  • 115. Scaling the content pipeline # Users Tuesday, October 23, 12
  • 116. Thank you henok@spotify.com Tuesday, October 23, 12
  • 117. Distribution/publish Popen + gevent (although IO-bound) import gevent gevent.monkey.patch_all() def _wait(self): while True: res = self.poll() if res is not None: return res gevent.sleep(0.1) subprocess.Popen.wait = _wait Tuesday, October 23, 12