The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature

•

1 j'aime•1,059 vues

The ContentMine project (http://contentmine.org) will harvest 100 million facts from the literature. Here we summarise the technology stack we're building to enable the first step: collecting the literature. This presentation was given with a paper (https://github.com/Blahah/scraperJSON-demo-paper) at WOSP 2014.

Données & analyses

The ContentMine
Scraping Stack
Richard Smith-Unna! ! Peter Murray-Rust
University of Cambridge

“make 100,000,000 facts from the scholarly
literature open, accessible and reusable”
our mission

The scale of the task
• ~ 27,000 peer reviewed journals (Ulrich's)
• > 5,000 publishers
• new papers every day

scraperJSON
• scrapers all have the same plumbing
• ignore the plumbing, just configure
• benefits
• supports large collections of scrapers
• no programming required
• not limited to one piece of software

Basic scraperJSON
{
"name": "PLOS",
"url": "plosw*.org",
"elements": {
"title": {
"selector": “//h1[@property=‘dc:title’]”,
}
}
}
!
name of the scraper
the URL(s) it applies to
the elements to capture
element name
where to find it
!
!
http://github.com/ContentMine/scraperJSON

bibJSON output
{
"title": "Ab Initio Identification of Novel
Regulatory Elements in the Genome of Trypanosoma
brucei by Bayesian Inference on Sequence
Segmentation"
}

thresher & quickscrape
• reference implementation of scraperJSON
• thresher is the scraping library
• http://github.com/ContentMine/thresher
• quickscrape is the command-line tool
• http://github.com/ContentMine/quickscrape
• Node.js, MIT licensed

journal-scrapers
http://github.com/ContentMine/journal-scrapers
a self-testing collection of scraperJSON scrapers for academic journals
• PLOS
• MDPI
• PeerJ
• Wiley
• ScienceDirect
• Springer
• Taylor & Francis
• NPG, AAAS, RSC, ACS, …

Future work
• GUI (browser plugin) for creating scrapers
• Standalone GUI for scraping

Acknowledgements
• Peter Murray-Rust
• Michelle Brook
• Mark MacGillivray
• Emanuil Tolev
• Ross Mounce
• Jenny Molloy
http://contentmine.org
http://github.com/ContentMine
• Our volunteer community and collaborators
• Funding: Shuttleworth Foundation

Recommandé

Automatic Extraction of Knowledge from Biomedical literaturepetermurrayrust

Cochrane workshop2016petermurrayrust

Europe PMC Section TaggerRichard Smith-Unna

Social, Political and Legal Aspects of Text and Data Mining (TDM)Richard Smith-Unna

2024 State of Marketing Report – by HubspotMarius Sescu

Everything You Need To Know About ChatGPTExpeed Software

Product Design Trends in 2024 | Teenage EngineeringsPixeldarts

How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow

Recommandé

Automatic Extraction of Knowledge from Biomedical literaturepetermurrayrust

Cochrane workshop2016petermurrayrust

Europe PMC Section TaggerRichard Smith-Unna

Social, Political and Legal Aspects of Text and Data Mining (TDM)Richard Smith-Unna

2024 State of Marketing Report – by HubspotMarius Sescu

Everything You Need To Know About ChatGPTExpeed Software

Product Design Trends in 2024 | Teenage EngineeringsPixeldarts

How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01

Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

Capstone Project on IBM Data Analytics ProgramMoniSankarHazra

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

April 2024 - Crypto Market Report's Analysismanisha194592

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823

Midocean dropshipping via API with DroFxolyaivanovalion

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823

Probability Grade 10 Third Quarter LessonsJoseMangaJr1

➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls

👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823

AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork

Skeleton Culture CodeSkeleton Technologies

Contenu connexe

Dernier

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01

Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

Capstone Project on IBM Data Analytics ProgramMoniSankarHazra

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

April 2024 - Crypto Market Report's Analysismanisha194592

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823

Midocean dropshipping via API with DroFxolyaivanovalion

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823

Probability Grade 10 Third Quarter LessonsJoseMangaJr1

➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls

👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Dernier (20)

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...

Abortion pills in Jeddah | +966572737505 | Get Cytotec

Capstone Project on IBM Data Analytics Program

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure

April 2024 - Crypto Market Report's Analysis

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...

Midocean dropshipping via API with DroFx

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...

Probability Grade 10 Third Quarter Lessons

➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service

👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand

En vedette

AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork

Skeleton Culture CodeSkeleton Technologies

PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley

Content Methodology: A Best Practices Report (Webinar)contently

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools

12 Ways to Increase Your Influence at WorkGetSmarter

ChatGPT webinar slidesAlireza Esmikhani

More than Just Lines on a Map: Best Practices for U.S Bike RoutesProject for Public Spaces & National Center for Biking and Walking

En vedette (20)

AI Trends in Creative Operations 2024 by Artwork Flow.pdf

Skeleton Culture Code

PEPSICO Presentation to CAGNY Conference Feb 2024

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

12 Ways to Increase Your Influence at Work

ChatGPT webinar slides

More than Just Lines on a Map: Best Practices for U.S Bike Routes

The ContentMine Scraping Stack: how we'll harvest 100,00,000 facts from the scientific literature

1. The ContentMine Scraping Stack Richard Smith-Unna! ! Peter Murray-Rust University of Cambridge

2. “make 100,000,000 facts from the scholarly literature open, accessible and reusable” our mission

3. The scale of the task • ~ 27,000 peer reviewed journals (Ulrich's) • > 5,000 publishers • new papers every day

4. The pipeline

5. scraperJSON • scrapers all have the same plumbing • ignore the plumbing, just configure • benefits • supports large collections of scrapers • no programming required • not limited to one piece of software

6. Basic scraperJSON { "name": "PLOS", "url": "plosw*.org", "elements": { "title": { "selector": “//h1[@property=‘dc:title’]”, } } } ! name of the scraper the URL(s) it applies to the elements to capture element name where to find it ! ! http://github.com/ContentMine/scraperJSON

10. Basic scraperJSON { "name": "PLoS", "url": "plosw*.org", "elements": { "title": { "selector": “//h1[@property=‘dc:title’]”, } } } ! name of the scraper the URL(s) it applies to the elements to capture element name where to find it ! ! http://github.com/ContentMine/scraperJSON

11. bibJSON output { "title": "Ab Initio Identification of Novel Regulatory Elements in the Genome of Trypanosoma brucei by Bayesian Inference on Sequence Segmentation" }

12. thresher & quickscrape • reference implementation of scraperJSON • thresher is the scraping library • http://github.com/ContentMine/thresher • quickscrape is the command-line tool • http://github.com/ContentMine/quickscrape • Node.js, MIT licensed

13. journal-scrapers http://github.com/ContentMine/journal-scrapers a self-testing collection of scraperJSON scrapers for academic journals • PLOS • MDPI • PeerJ • Wiley • ScienceDirect • Springer • Taylor & Francis • NPG, AAAS, RSC, ACS, …

14. Future work • GUI (browser plugin) for creating scrapers • Standalone GUI for scraping

15. Acknowledgements • Peter Murray-Rust • Michelle Brook • Mark MacGillivray • Emanuil Tolev • Ross Mounce • Jenny Molloy http://contentmine.org http://github.com/ContentMine • Our volunteer community and collaborators • Funding: Shuttleworth Foundation