SlideShare a Scribd company logo
1 of 25
Download to read offline
Importing Wikipedia in Plone
Eric BREHAULT – Plone Conference 2013
ZODB is good at storing objects
● Plone contents are objects,
● we store them in the ZODB,
● everything is fine, end of the story.
But what if ...
... we want to store non-contentish records?
Like polls, statistics, mail-list subscribers, etc.,
or any business-specific structured data.
Store them as contents anyway
That is a powerfull solution.
But there are 2 major problems...
Problem 1: You need to manage a secondary
system
● you need to deploy it,
● you need to backup it,
● you need to secure it,
● etc.
Problem 2: I hate SQL
No explanation here.
I think I just cannot digest it...
How to store many records in the ZODB?
● Is the ZODB strong enough?
● Is the ZCatalog strong enough?
My grandmother often told me
"If you want to become stronger, you have to eat your soup."
Where do we find a good soup for Plone?
In a super souper!!!
souper.plone and souper
● It provides both storage and indexing.
● Record can store any persistent pickable data.
● Created by BlueDynamics.
● Based on ZODB BTrees, node.ext.zodb, and repoze.catalog.
Add a record
>>> soup = get_soup('mysoup', context)
>>> record = Record()
>>> record.attrs['user'] = 'user1'
>>> record.attrs['text'] = u'foo bar baz'
>>> record.attrs['keywords'] = [u'1', u'2', u'ü']
>>> record_id = soup.add(record)
Record in record
>>> record['homeaddress'] = Record()
>>> record['homeaddress'].attrs['zip'] = '6020'
>>> record['homeaddress'].attrs['town'] = 'Innsbruck'
>>> record['homeaddress'].attrs['country'] = 'Austria'
Access record
>>> from souper.soup import get_soup
>>> soup = get_soup('mysoup', context)
>>> record = soup.get(record_id)
Query
>>> from repoze.catalog.query import Eq, Contains
>>> [r for r in soup.query(Eq('user', 'user1')
& Contains('text', 'foo'))]
[<Record object 'None' at ...>]
or using CQE format
>>> [r for r in soup.query("user == 'user1' and 'foo' in text")]
[<Record object 'None' at ...>]
souper
● a Soup-container can be moved to a specific ZODB mount-
point,
● it can be shared across multiple independent Plone instances,
● souper works on Plone and Pyramid.
Plomino & souper
● we use Plomino to build non-content oriented apps easily,
● we use souper to store huge amount of application data.
Plomino data storage
Originally, documents (=record) were ATFolder.
Capacity about 30 000.
Plomino data storage
Since 1.14, documents are pure CMF.
Capacity about 100 000.
Usally the Plomino ZCatalog contains a lot of indexes.
Plomino & souper
With souper, documents are just soup records.
Capacity: several millions.
Typical use case
● Store 500 000 addresses,
● Be able to query them in full text and display the result on a map.
Demo
What is the limit?
Can we import Wikipedia in souper?
Demo with 400 000 records
Demo with 5,5 millions of records
Conclusion
● Usage performances are good,
● Plone performances are not impacted.
Use it!
Thoughts
● What about a REST API on top of it?
● Massive import is long and difficult, could it be improved?
Makina Corpus
For all questions related to this talk,
please contact Éric Bréhault
eric.brehault@makina-corpus.com
Tel : +33 534 566 958
www.makina-corpus.com

More Related Content

Similar to Importing Wikipedia in Plone

Centralized logging system using mongoDB
Centralized logging system using mongoDBCentralized logging system using mongoDB
Centralized logging system using mongoDBVivek Parihar
 
Async IO and Multithreading explained
Async IO and Multithreading explainedAsync IO and Multithreading explained
Async IO and Multithreading explainedDirecti Group
 
Buildout: creating and deploying repeatable applications in python
Buildout: creating and deploying repeatable applications in pythonBuildout: creating and deploying repeatable applications in python
Buildout: creating and deploying repeatable applications in pythonCodeSyntax
 
python-160403194316.pdf
python-160403194316.pdfpython-160403194316.pdf
python-160403194316.pdfgmadhu8
 
MongoDB and AWS Best Practices
MongoDB and AWS Best PracticesMongoDB and AWS Best Practices
MongoDB and AWS Best PracticesMongoDB
 
Python Seminar PPT
Python Seminar PPTPython Seminar PPT
Python Seminar PPTShivam Gupta
 
Running a Plone product on Substance D
Running a Plone product on Substance DRunning a Plone product on Substance D
Running a Plone product on Substance DMakina Corpus
 
A novel approach to Undo
A novel approach to UndoA novel approach to Undo
A novel approach to UndoSean Upton
 
Linux multiplexing
Linux multiplexingLinux multiplexing
Linux multiplexingMark Veltzer
 
GSoC2014 - PGCon2015 Presentation June, 2015
GSoC2014 - PGCon2015 Presentation June, 2015GSoC2014 - PGCon2015 Presentation June, 2015
GSoC2014 - PGCon2015 Presentation June, 2015Fabrízio Mello
 
Python_Introduction&DataType.pptx
Python_Introduction&DataType.pptxPython_Introduction&DataType.pptx
Python_Introduction&DataType.pptxHaythamBarakeh1
 
Prototype4Production Presented at FOSSASIA2015 at Singapore
Prototype4Production Presented at FOSSASIA2015 at SingaporePrototype4Production Presented at FOSSASIA2015 at Singapore
Prototype4Production Presented at FOSSASIA2015 at SingaporeDhruv Gohil
 
AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...
AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...
AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...GeeksLab Odessa
 
python presntation 2.pptx
python presntation 2.pptxpython presntation 2.pptx
python presntation 2.pptxArpittripathi45
 
Python and Pytorch tutorial and walkthrough
Python and Pytorch tutorial and walkthroughPython and Pytorch tutorial and walkthrough
Python and Pytorch tutorial and walkthroughgabriellekuruvilla
 

Similar to Importing Wikipedia in Plone (20)

Centralized logging system using mongoDB
Centralized logging system using mongoDBCentralized logging system using mongoDB
Centralized logging system using mongoDB
 
Async IO and Multithreading explained
Async IO and Multithreading explainedAsync IO and Multithreading explained
Async IO and Multithreading explained
 
Buildout: creating and deploying repeatable applications in python
Buildout: creating and deploying repeatable applications in pythonBuildout: creating and deploying repeatable applications in python
Buildout: creating and deploying repeatable applications in python
 
python-160403194316.pdf
python-160403194316.pdfpython-160403194316.pdf
python-160403194316.pdf
 
MongoDB and AWS Best Practices
MongoDB and AWS Best PracticesMongoDB and AWS Best Practices
MongoDB and AWS Best Practices
 
Data analysis with pandas
Data analysis with pandasData analysis with pandas
Data analysis with pandas
 
Data Analysis With Pandas
Data Analysis With PandasData Analysis With Pandas
Data Analysis With Pandas
 
Python
PythonPython
Python
 
Python Seminar PPT
Python Seminar PPTPython Seminar PPT
Python Seminar PPT
 
python into.pptx
python into.pptxpython into.pptx
python into.pptx
 
Running a Plone product on Substance D
Running a Plone product on Substance DRunning a Plone product on Substance D
Running a Plone product on Substance D
 
A novel approach to Undo
A novel approach to UndoA novel approach to Undo
A novel approach to Undo
 
Linux multiplexing
Linux multiplexingLinux multiplexing
Linux multiplexing
 
GSoC2014 - PGCon2015 Presentation June, 2015
GSoC2014 - PGCon2015 Presentation June, 2015GSoC2014 - PGCon2015 Presentation June, 2015
GSoC2014 - PGCon2015 Presentation June, 2015
 
Python_Introduction&DataType.pptx
Python_Introduction&DataType.pptxPython_Introduction&DataType.pptx
Python_Introduction&DataType.pptx
 
Prototype4Production Presented at FOSSASIA2015 at Singapore
Prototype4Production Presented at FOSSASIA2015 at SingaporePrototype4Production Presented at FOSSASIA2015 at Singapore
Prototype4Production Presented at FOSSASIA2015 at Singapore
 
AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...
AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...
AI&BigData Lab. Александр Конопко "Celos: оркестрирование и тестирование зада...
 
python presntation 2.pptx
python presntation 2.pptxpython presntation 2.pptx
python presntation 2.pptx
 
Python and Pytorch tutorial and walkthrough
Python and Pytorch tutorial and walkthroughPython and Pytorch tutorial and walkthrough
Python and Pytorch tutorial and walkthrough
 
Conf orm - explain
Conf orm - explainConf orm - explain
Conf orm - explain
 

More from Makina Corpus

Développer des applications mobiles avec phonegap
Développer des applications mobiles avec phonegapDévelopper des applications mobiles avec phonegap
Développer des applications mobiles avec phonegapMakina Corpus
 
Why CMS will not die
Why CMS will not dieWhy CMS will not die
Why CMS will not dieMakina Corpus
 
Team up Django and Web mapping - DjangoCon Europe 2014
Team up Django and Web mapping - DjangoCon Europe 2014Team up Django and Web mapping - DjangoCon Europe 2014
Team up Django and Web mapping - DjangoCon Europe 2014Makina Corpus
 
Petit déjeuner "Les bases de la cartographie sur le Web"
Petit déjeuner "Les bases de la cartographie sur le Web"Petit déjeuner "Les bases de la cartographie sur le Web"
Petit déjeuner "Les bases de la cartographie sur le Web"Makina Corpus
 
Petit déjeuner "Développer sur le cloud, ou comment tout construire à partir ...
Petit déjeuner "Développer sur le cloud, ou comment tout construire à partir ...Petit déjeuner "Développer sur le cloud, ou comment tout construire à partir ...
Petit déjeuner "Développer sur le cloud, ou comment tout construire à partir ...Makina Corpus
 
CoDe, le programme de développement d'applications mobiles de Makina Corpus
CoDe, le programme de développement d'applications mobiles de Makina Corpus CoDe, le programme de développement d'applications mobiles de Makina Corpus
CoDe, le programme de développement d'applications mobiles de Makina Corpus Makina Corpus
 
Petit déjeuner "Alternatives libres à GoogleMaps" du 11 février 2014 - Nantes...
Petit déjeuner "Alternatives libres à GoogleMaps" du 11 février 2014 - Nantes...Petit déjeuner "Alternatives libres à GoogleMaps" du 11 février 2014 - Nantes...
Petit déjeuner "Alternatives libres à GoogleMaps" du 11 février 2014 - Nantes...Makina Corpus
 
Petit déjeuner "Les nouveautés de la cartographie en ligne" du 12 décembre
Petit déjeuner "Les nouveautés de la cartographie en ligne" du 12 décembrePetit déjeuner "Les nouveautés de la cartographie en ligne" du 12 décembre
Petit déjeuner "Les nouveautés de la cartographie en ligne" du 12 décembreMakina Corpus
 
Alternatives libres à Google Maps
Alternatives libres à Google MapsAlternatives libres à Google Maps
Alternatives libres à Google MapsMakina Corpus
 
Atelier "Les nouveautés de la cartographie en ligne"
Atelier "Les nouveautés de la cartographie en ligne"Atelier "Les nouveautés de la cartographie en ligne"
Atelier "Les nouveautés de la cartographie en ligne"Makina Corpus
 
Petit Déjeuner : HTML5 et CSS3, les interfaces de demain.
Petit Déjeuner : HTML5 et CSS3, les interfaces de demain.Petit Déjeuner : HTML5 et CSS3, les interfaces de demain.
Petit Déjeuner : HTML5 et CSS3, les interfaces de demain.Makina Corpus
 
Des cartes d'un autre monde - DjangoCong 2012
Des cartes d'un autre monde - DjangoCong 2012Des cartes d'un autre monde - DjangoCong 2012
Des cartes d'un autre monde - DjangoCong 2012Makina Corpus
 
Solutions libres alternatives à Google Maps
Solutions libres alternatives à Google MapsSolutions libres alternatives à Google Maps
Solutions libres alternatives à Google MapsMakina Corpus
 

More from Makina Corpus (15)

Développer des applications mobiles avec phonegap
Développer des applications mobiles avec phonegapDévelopper des applications mobiles avec phonegap
Développer des applications mobiles avec phonegap
 
Why CMS will not die
Why CMS will not dieWhy CMS will not die
Why CMS will not die
 
Team up Django and Web mapping - DjangoCon Europe 2014
Team up Django and Web mapping - DjangoCon Europe 2014Team up Django and Web mapping - DjangoCon Europe 2014
Team up Django and Web mapping - DjangoCon Europe 2014
 
Petit déjeuner "Les bases de la cartographie sur le Web"
Petit déjeuner "Les bases de la cartographie sur le Web"Petit déjeuner "Les bases de la cartographie sur le Web"
Petit déjeuner "Les bases de la cartographie sur le Web"
 
Petit déjeuner "Développer sur le cloud, ou comment tout construire à partir ...
Petit déjeuner "Développer sur le cloud, ou comment tout construire à partir ...Petit déjeuner "Développer sur le cloud, ou comment tout construire à partir ...
Petit déjeuner "Développer sur le cloud, ou comment tout construire à partir ...
 
CoDe, le programme de développement d'applications mobiles de Makina Corpus
CoDe, le programme de développement d'applications mobiles de Makina Corpus CoDe, le programme de développement d'applications mobiles de Makina Corpus
CoDe, le programme de développement d'applications mobiles de Makina Corpus
 
Petit déjeuner "Alternatives libres à GoogleMaps" du 11 février 2014 - Nantes...
Petit déjeuner "Alternatives libres à GoogleMaps" du 11 février 2014 - Nantes...Petit déjeuner "Alternatives libres à GoogleMaps" du 11 février 2014 - Nantes...
Petit déjeuner "Alternatives libres à GoogleMaps" du 11 février 2014 - Nantes...
 
Petit déjeuner "Les nouveautés de la cartographie en ligne" du 12 décembre
Petit déjeuner "Les nouveautés de la cartographie en ligne" du 12 décembrePetit déjeuner "Les nouveautés de la cartographie en ligne" du 12 décembre
Petit déjeuner "Les nouveautés de la cartographie en ligne" du 12 décembre
 
Alternatives libres à Google Maps
Alternatives libres à Google MapsAlternatives libres à Google Maps
Alternatives libres à Google Maps
 
Atelier "Les nouveautés de la cartographie en ligne"
Atelier "Les nouveautés de la cartographie en ligne"Atelier "Les nouveautés de la cartographie en ligne"
Atelier "Les nouveautés de la cartographie en ligne"
 
Petit Déjeuner : HTML5 et CSS3, les interfaces de demain.
Petit Déjeuner : HTML5 et CSS3, les interfaces de demain.Petit Déjeuner : HTML5 et CSS3, les interfaces de demain.
Petit Déjeuner : HTML5 et CSS3, les interfaces de demain.
 
Geotrek
GeotrekGeotrek
Geotrek
 
Plomino
Plomino Plomino
Plomino
 
Des cartes d'un autre monde - DjangoCong 2012
Des cartes d'un autre monde - DjangoCong 2012Des cartes d'un autre monde - DjangoCong 2012
Des cartes d'un autre monde - DjangoCong 2012
 
Solutions libres alternatives à Google Maps
Solutions libres alternatives à Google MapsSolutions libres alternatives à Google Maps
Solutions libres alternatives à Google Maps
 

Recently uploaded

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Recently uploaded (20)

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Importing Wikipedia in Plone

  • 1. Importing Wikipedia in Plone Eric BREHAULT – Plone Conference 2013
  • 2. ZODB is good at storing objects ● Plone contents are objects, ● we store them in the ZODB, ● everything is fine, end of the story.
  • 3. But what if ... ... we want to store non-contentish records? Like polls, statistics, mail-list subscribers, etc., or any business-specific structured data.
  • 4. Store them as contents anyway That is a powerfull solution. But there are 2 major problems...
  • 5. Problem 1: You need to manage a secondary system ● you need to deploy it, ● you need to backup it, ● you need to secure it, ● etc.
  • 6. Problem 2: I hate SQL No explanation here.
  • 7. I think I just cannot digest it...
  • 8. How to store many records in the ZODB? ● Is the ZODB strong enough? ● Is the ZCatalog strong enough?
  • 9. My grandmother often told me "If you want to become stronger, you have to eat your soup."
  • 10. Where do we find a good soup for Plone? In a super souper!!!
  • 11. souper.plone and souper ● It provides both storage and indexing. ● Record can store any persistent pickable data. ● Created by BlueDynamics. ● Based on ZODB BTrees, node.ext.zodb, and repoze.catalog.
  • 12. Add a record >>> soup = get_soup('mysoup', context) >>> record = Record() >>> record.attrs['user'] = 'user1' >>> record.attrs['text'] = u'foo bar baz' >>> record.attrs['keywords'] = [u'1', u'2', u'ü'] >>> record_id = soup.add(record)
  • 13. Record in record >>> record['homeaddress'] = Record() >>> record['homeaddress'].attrs['zip'] = '6020' >>> record['homeaddress'].attrs['town'] = 'Innsbruck' >>> record['homeaddress'].attrs['country'] = 'Austria'
  • 14. Access record >>> from souper.soup import get_soup >>> soup = get_soup('mysoup', context) >>> record = soup.get(record_id)
  • 15. Query >>> from repoze.catalog.query import Eq, Contains >>> [r for r in soup.query(Eq('user', 'user1') & Contains('text', 'foo'))] [<Record object 'None' at ...>] or using CQE format >>> [r for r in soup.query("user == 'user1' and 'foo' in text")] [<Record object 'None' at ...>]
  • 16. souper ● a Soup-container can be moved to a specific ZODB mount- point, ● it can be shared across multiple independent Plone instances, ● souper works on Plone and Pyramid.
  • 17. Plomino & souper ● we use Plomino to build non-content oriented apps easily, ● we use souper to store huge amount of application data.
  • 18. Plomino data storage Originally, documents (=record) were ATFolder. Capacity about 30 000.
  • 19. Plomino data storage Since 1.14, documents are pure CMF. Capacity about 100 000. Usally the Plomino ZCatalog contains a lot of indexes.
  • 20. Plomino & souper With souper, documents are just soup records. Capacity: several millions.
  • 21. Typical use case ● Store 500 000 addresses, ● Be able to query them in full text and display the result on a map. Demo
  • 22. What is the limit? Can we import Wikipedia in souper? Demo with 400 000 records Demo with 5,5 millions of records
  • 23. Conclusion ● Usage performances are good, ● Plone performances are not impacted. Use it!
  • 24. Thoughts ● What about a REST API on top of it? ● Massive import is long and difficult, could it be improved?
  • 25. Makina Corpus For all questions related to this talk, please contact Éric Bréhault eric.brehault@makina-corpus.com Tel : +33 534 566 958 www.makina-corpus.com