Avoid Common Failures When Analyzing Wikipedia Data

•

3 j'aime•566 vues

José Felipe Ortega

Formation Technologie

Common failures to be avoided
when we analyze Wikipedia
public data

Felipe Ortega
GsyC/Libresoft

Wikimania 2010, Gdańsk, Poland.

#1 Beware of special types of pages
● Official page count includes articles with just one link.
● You must consider if you need to filter out
disambiguation pages.
● Pay attention to redirects.
● Sometimes people wonder how the number of
pages in main namespace in the dump is so high.
● Break down evolution trends by namespace.
● Articles are very different from other pages.
● Explore % of already existing talk pages.
● Connections from user pages.

#2 Plan your hardware carefully
● There are some general rules.
● Parallelize as much as possible.
● Buy more memory before buying more disk...
● But take a look at your disk requirements.
– It's very different when you can work on
decompressed data, on the fly.
● Hardware RAID is not always the best solution.
– RAID 10 in Linux can perform decently in many
average studies.

3# Know your engines (DBs)
● Correct configuration of DB engine is crucial.
● You'll always fall short with standard configs.
● Fine tune parameters according to your hardware.
● Exploit memory as much as possible.
– E.g. MEMORY engine in MySQL.
● Avoid unnecessary backup...
● But be sure that you have copies of relevant info
elseware!
● Think about your process:
– Read only vs. read-write.

4# Organize your code
● Using a SCM is a must.
● SVN, GIT.
● Upload your code to public repository.
● BerliOS, SourceForge, GitHub...
● Document your code...
● ...if you ever aspire to get interest from other
developers.
● Use consistent version numbers.
● Test, test, test...
● Include sanity checks and “toy tests”.

#5 Use the right “spell”
● Target data is well defined:
● XML
● Big portions of plain text
● Inter-wiki links and outlinks.
● Some alternatives
● CelementTree (high-speed parsing)
● Python (modules/short scripts) or Java (big
projects).
● Perl (regexps).
● Sed & awk

#6 Avoid reinventing the wheel
● Consider to develop only if:
● No available solution fits your needs.
– Or you can only find proprietary/evaluation
sofwtare.
● Performance of other solutions is really bad
● Example: pywikipediabot
● Simple library to query Wikipedia API.
● Solves many simple needs of
researchers/programmers.

#7 Automate everything

● Huge data repositories.
● Even small samples are excessively time
consuming if processed by hand.
● You will start to concat individual processes.
● You will save time for later executions.
● Your study will be reproducible.
● Updating results after several months
becomes no-brainer solution.

#8 Extreme case of Murphy's Law

● Always expect the worst possible case.
● Many caveats in each implementation.
● Countless particular cases.
● It's not OK with just the “average
solution”.
– Standard algorithms may take much
more than expected to finish the job.

#9 Not many graphical interfaces

● Some good reasons for that
● Difficult to automate
● Hard to display dynamic results in real-time.

● Almost impossible to compute all results in a

reasonable time frame for huge data
collections (e.g. English Wikipedia).
● To the best of my knowledge, there are very
few tools with graphical interfaces out there.
● Is there a real need for that??

#10 Communication channels

● Wikimedia-research-l
● Mailing list about research on Wikimedia
projects.
● http://meta.wikimedia.org/wiki/Research
● http://meta.wikimedia.org/wiki/Wikimedia_Research_Network
●
http://acawiki.org/Home
● Final comments
● Need for consolidated info point, once for all

Recommandé

Active Data Stores at 30,000ftJeffrey Sica

Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief...EUDAT

OSMC 2019 | How to improve database Observability by Charles JudithNETWAYS

ドイツ大使館向け「事業認証申請書」Naoki Miyano

Germany and Japan Cooperative Support PlanNaoki Miyano

Wine,love&lifehermione2009

Cara mudah update joomla 2Yahya Rahman

Cloud accounting software ukArcus Universe Ltd

Recommandé

Active Data Stores at 30,000ftJeffrey Sica

Economic Sustainabilityof Digital Preservation - Prof. David Rosenthal, Chief...EUDAT

OSMC 2019 | How to improve database Observability by Charles JudithNETWAYS

ドイツ大使館向け「事業認証申請書」Naoki Miyano

Germany and Japan Cooperative Support PlanNaoki Miyano

Wine,love&lifehermione2009

Cara mudah update joomla 2Yahya Rahman

Cloud accounting software ukArcus Universe Ltd

Become a Better Developer with Debugging Techniques for Drupal (and more!)Acquia

Ceph Day SF 2015 - Keynote Ceph Community

engage 2014 - JavaBlastRené Winkelmeyer

Ceph: A decade in the making and still going strongPatrick McGarry

Go at SkroutzAgisAnastasopoulos

LCE12: Intro Training: Upstreaming 101Linaro

OpenCms Days 2014 - Introducing the 9.5 OpenCms documentationAlkacon Software GmbH & Co. KG

Path dependent-development (PyCon India)ncoghlan_dev

2019 PHP Serbia - Boosting your performance with BlackfireMarko Mitranić

Create your libraryLaurence Chen

The Good, the Bad and the Ugly things to do with androidStanojko Markovik

Ceph Day New York: Ceph: one decade inCeph Community

Liferay portals in real projectsIBACZ

C4ainaction-Introduction to the Pyramid Web FrameworkFrancis Addai

An overview of data and web-application development with PythonSivaranjan Goswami

The Professional ProgrammerDave Cross

Performance optimization techniques for Java codeAttila Balazs

Path Dependent Development (PyCon AU)ncoghlan_dev

Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph Ceph Community

Ceph Day NYC: Building Tomorrow's CephCeph Community

INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña

Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco

Contenu connexe

Similaire à Avoid Common Failures When Analyzing Wikipedia Data

Become a Better Developer with Debugging Techniques for Drupal (and more!)Acquia

Ceph Day SF 2015 - Keynote Ceph Community

engage 2014 - JavaBlastRené Winkelmeyer

Ceph: A decade in the making and still going strongPatrick McGarry

Go at SkroutzAgisAnastasopoulos

LCE12: Intro Training: Upstreaming 101Linaro

OpenCms Days 2014 - Introducing the 9.5 OpenCms documentationAlkacon Software GmbH & Co. KG

Path dependent-development (PyCon India)ncoghlan_dev

2019 PHP Serbia - Boosting your performance with BlackfireMarko Mitranić

Create your libraryLaurence Chen

The Good, the Bad and the Ugly things to do with androidStanojko Markovik

Ceph Day New York: Ceph: one decade inCeph Community

Liferay portals in real projectsIBACZ

C4ainaction-Introduction to the Pyramid Web FrameworkFrancis Addai

An overview of data and web-application development with PythonSivaranjan Goswami

The Professional ProgrammerDave Cross

Performance optimization techniques for Java codeAttila Balazs

Path Dependent Development (PyCon AU)ncoghlan_dev

Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph Ceph Community

Ceph Day NYC: Building Tomorrow's CephCeph Community

Similaire à Avoid Common Failures When Analyzing Wikipedia Data (20)

Become a Better Developer with Debugging Techniques for Drupal (and more!)

Ceph Day SF 2015 - Keynote

engage 2014 - JavaBlast

Ceph: A decade in the making and still going strong

Go at Skroutz

LCE12: Intro Training: Upstreaming 101

OpenCms Days 2014 - Introducing the 9.5 OpenCms documentation

Path dependent-development (PyCon India)

2019 PHP Serbia - Boosting your performance with Blackfire

Create your library

The Good, the Bad and the Ugly things to do with android

Ceph Day New York: Ceph: one decade in

Liferay portals in real projects

C4ainaction-Introduction to the Pyramid Web Framework

An overview of data and web-application development with Python

The Professional Programmer

Performance optimization techniques for Java code

Path Dependent Development (PyCon AU)

Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph

Ceph Day NYC: Building Tomorrow's Ceph

Dernier

INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña

Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco

Difference Between Search & Browse Methods in Odoo 17Celine George

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection

Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27

ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1

GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2

AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood

ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1

Concurrency Control in Database Management systemChristalin Nelson

Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105

USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.

ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli

Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy

Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir

ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1

FILIPINO PSYCHology sikolohiyang pilipinojohnmickonozaleda

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood

4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239

Keynote by Prof. Wurzer at Nordex about IP-designMIPLM

Dernier (20)

INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx

Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf

Difference Between Search & Browse Methods in Odoo 17

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...

Science 7 Quarter 4 Module 2: Natural Resources.pptx

ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...

GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS

AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx

ANG SEKTOR NG agrikultura.pptx QUARTER 4

Concurrency Control in Database Management system

Barangay Council for the Protection of Children (BCPC) Orientation.pptx

USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...

ACC 2024 Chronicles. Cardiology. Exam.pdf

Student Profile Sample - We help schools to connect the data they have, with ...

Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf

ENGLISH6-Q4-W3.pptxqurter our high choom

FILIPINO PSYCHology sikolohiyang pilipino

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx

4.18.24 Movement Legacies, Reflection, and Review.pptx

Keynote by Prof. Wurzer at Nordex about IP-design

Avoid Common Failures When Analyzing Wikipedia Data

1. Common failures to be avoided when we analyze Wikipedia public data Felipe Ortega GsyC/Libresoft Wikimania 2010, Gdańsk, Poland.

2. #1 Beware of special types of pages ● Official page count includes articles with just one link. ● You must consider if you need to filter out disambiguation pages. ● Pay attention to redirects. ● Sometimes people wonder how the number of pages in main namespace in the dump is so high. ● Break down evolution trends by namespace. ● Articles are very different from other pages. ● Explore % of already existing talk pages. ● Connections from user pages.

3. #2 Plan your hardware carefully ● There are some general rules. ● Parallelize as much as possible. ● Buy more memory before buying more disk... ● But take a look at your disk requirements. – It's very different when you can work on decompressed data, on the fly. ● Hardware RAID is not always the best solution. – RAID 10 in Linux can perform decently in many average studies.

4. 3# Know your engines (DBs) ● Correct configuration of DB engine is crucial. ● You'll always fall short with standard configs. ● Fine tune parameters according to your hardware. ● Exploit memory as much as possible. – E.g. MEMORY engine in MySQL. ● Avoid unnecessary backup... ● But be sure that you have copies of relevant info elseware! ● Think about your process: – Read only vs. read-write.

5. 4# Organize your code ● Using a SCM is a must. ● SVN, GIT. ● Upload your code to public repository. ● BerliOS, SourceForge, GitHub... ● Document your code... ● ...if you ever aspire to get interest from other developers. ● Use consistent version numbers. ● Test, test, test... ● Include sanity checks and “toy tests”.

6. #5 Use the right “spell” ● Target data is well defined: ● XML ● Big portions of plain text ● Inter-wiki links and outlinks. ● Some alternatives ● CelementTree (high-speed parsing) ● Python (modules/short scripts) or Java (big projects). ● Perl (regexps). ● Sed & awk

7. #6 Avoid reinventing the wheel ● Consider to develop only if: ● No available solution fits your needs. – Or you can only find proprietary/evaluation sofwtare. ● Performance of other solutions is really bad ● Example: pywikipediabot ● Simple library to query Wikipedia API. ● Solves many simple needs of researchers/programmers.

8. #7 Automate everything ● Huge data repositories. ● Even small samples are excessively time consuming if processed by hand. ● You will start to concat individual processes. ● You will save time for later executions. ● Your study will be reproducible. ● Updating results after several months becomes no-brainer solution.

9. #8 Extreme case of Murphy's Law ● Always expect the worst possible case. ● Many caveats in each implementation. ● Countless particular cases. ● It's not OK with just the “average solution”. – Standard algorithms may take much more than expected to finish the job.

10. #9 Not many graphical interfaces ● Some good reasons for that ● Difficult to automate ● Hard to display dynamic results in real-time. ● Almost impossible to compute all results in a reasonable time frame for huge data collections (e.g. English Wikipedia). ● To the best of my knowledge, there are very few tools with graphical interfaces out there. ● Is there a real need for that??

11. #10 Communication channels ● Wikimedia-research-l ● Mailing list about research on Wikimedia projects. ● http://meta.wikimedia.org/wiki/Research ● http://meta.wikimedia.org/wiki/Wikimedia_Research_Network ● http://acawiki.org/Home ● Final comments ● Need for consolidated info point, once for all