SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Facilitating Text and Data Mining
ALA Midwinter
26 Jan 2014
Chris Shillum

VP, Platform and Content
Elsevier A&G Research Markets
Outline
•
•
•
•
•

Our new policy
What is Text and Data Mining (TDM)?
Timeline of TDM at Elsevier
Practicalities
Outlook

2
Elsevier’s New Text and Data Mining Policy
Researchers at academic institutions can text mine
subscribed content on ScienceDirect for noncommercial purposes via the ScienceDirect APIs
Access is granted to faculty, researchers, staff and
students at the subscribing institution
Text mining output can be shared publically
under these conditions
1. May contain "snippets" of up to 200
characters of the original text
2. Should be licensed as CC-BY-NC
3. Should include DOI link to original content

http://www.elsevier.com/tdm

Corporate and other subscribers
• Your Elsevier Account Manager will be
happy to discuss options to meet your
needs
Open access content
• Text and Data mining permission are
determined by the author's choice of user
license. This information is detailed in the
individual articles
3
Policy Aligned with the Recent STM
Declaration on TDM

http://www.stm-assoc.org/2013_11_11_Text_and_Data_Mining_Declaration.pdf

4
Text and Data Mining – What is it?
Data mining: “the computational process of discovering patterns in
large data sets involving methods at the intersection of artificial
intelligence, machine learning, statistics, and database systems”
Text mining: essentially a subset/“input step” to data mining: turning
text into structured data for analysis, via application of natural language
processing (NLP) and analytical methods.

Typical text mining tasks include text categorization, text
clustering, concept/entity extraction, production of granular
taxonomies, sentiment analysis, document summarization, and entity
relation modeling (i.e., learning relations between named entities).

http://en.wikipedia.org/wiki/Data_mining

5
TDM Examples – Entity Tagging and Linking

http://dx.doi.org/10.1016/j.bbalip.2012.02.009

6
TDM Examples – Information Extraction

http://genome.ucsc.edu/
7
TDM Examples – Information extraction

http://neuroelectro.org
8
TDM Examples – Natural Language Modelling

http://www.elsevier.com/online-tools/pathway-studio/

9

9
Text mining is Becoming an Essential Tool
Blue Blain project
• Reconstructing the brain piece by
piece and building a virtual brain
in a supercomputer
• Received €500M (~$685m) of EU
funding
“The success of the Blue Brain project depends on very high volumes of standardized,
high quality experimental data covering all possible levels of brain organization. Data
comes both from the literature (via the project’s automatic information extraction
tools) and from experimental work conducted by the project itself.”
http://bluebrain.epfl.ch/page-58110-en.html

http://bluebrain.epfl.ch/

10
Typical TDM Workflow

Source: Sophia Ananiadou/National Center for Text Mining, University of Manchester, UK
11
Supporting TDM at Elsevier
2006

…

• Began to support ad-hoc TDM access requests from customers
• Low but consistently increasing level of interest from early
adopters as computing power increases and tools get better

• First Content Mining policy published
• New APIs and Content Syndication Service rolled out to
2012
provide better technical solutions for TDM content access
• Pilot with ~30 academic customers to better understand needs
and define future policy
2013
• New Text and Data Mining policy for academic customers
announced
2014
12
TDM Pilot Learnings – Use Cases
Most academic TM requests fall in one or both of these categories:
1. Answering a specific
research question

2. Building a new data
resource for the community

• How long does it take for
concepts in STM literature to
reach general media?
• What is the relationship
between the research and
consulting commitments of
economics and finance
professors?
• What are the characteristics of
subjects in social psychology
experiments?

• An HIV mutation database for
which mutations found in
literature are mapped to the
underlying database sequence
• A database on growth and
alimentation of fishes, and
develop a fish classification to
identify new species for
aquaculture
• A database with the
electrophysiological properties
of diverse neuron types
13
TDM Pilot Learnings – Researcher Challenges
Technical

• Obtaining necessary infrastructure
• Having to deal with different formats from
content providers
• Sourcing and understanding TDM technology

Functional

• Fine-tuning pipeline, curating output,
representing output meaningfully

Logistical/Legal

• Gaining access to the needed content
• Gaining permission to mine the content

14
TDM Pilot Learnings – Library Challenges
Expertise

• Understanding specific TDM-based projects well
enough to assess implications & offer advice to
patrons

Legal

• Understanding and tracking what is allowed for
what resources
• Negotiating permissions with multiple providers
• Ensuring academic freedom is protected

Financial

• Concerns about any additional costs
• Understanding how TDM affects usage figures
for the library
15
TDM Pilot – Conclusions
1. Elsevier can offer enhanced value to
ScienceDirect customers by including basic text
and data mining access rights in subscription
agreements
2. Self-service access to content for TDM via our
APIs meets the needs of most researchers in
academia
3. There is demand for services beyond basic
content access that make text mining easier

16
TDM Access: What do Institutions
Need to do?
• TDM access clause will be part of standard
ScienceDirect subscription agreement for new
academic customers and upon renewal
• For existing agreements, an add-on contract
amendment is available – just contact your Elsevier
Account Manager
• After signing institutional agreement/amendment,
access to our API key registration page for your
researchers will be enabled for your institution’s IP
address range
17
TDM Access: What do Researchers
Need to do?
1. Register at http://developers.elsevier.com

18
TDM Access: What do Researchers
Need to do?
2. Accept a simple click-through agreement

19
TDM Access: What do Researchers
Need to do?
3. Obtain an Elsevier API Key

20
TDM Access: What do Researchers
Need to do?
4. Use API Key to retrieve full text of journal articles
and book chapters via the Elsevier API
• Elsevier XML and plain-text formats supported
• We are looking into supporting specialized textmining friendly XML formats

5. Process retrieved full-text through text mining
tools/workflow of choice
Documentation available on developer portal
21
Why an API? Can’t we just crawl
ScienceDirect.com?
• We are providing separate channels for machine-tomachine and human access to content
• This helps us to maintain site performance, availability
and reliability for everyone
• Other major web information sources have similar
policies
• Wikipedia:
http://en.wikipedia.org/wiki/Wikipedia:Database_download
• Twitter: https://twitter.com/tos
• PubMed Central:
http://www.ncbi.nlm.nih.gov/pmc/about/copyright/

• People who text-mine prefer APIs!!
22
Accessing an HTML page

23
Accessing an XML document

24
Supporting Cross-publisher TDM

Prospect is a new service from CrossRef providing two components to
address the issue of text and data mining scholarly literature across multiple
publishers:
• The “Prospect Common API” (PCAPI) can be used to access the full text of content
identified by CrossRef DOIs across publisher sites and regardless of their business
model.
• The “Prospect License Registry” (PLR) can (optionally) be used by researchers and
publishers as an efficient mechanism to provide “click-through” agreement of proprietary
TDM licenses.
• Both components are free to use by researchers and the public
https://prospect.crossref.org/
25
Prospect Workflow

26
Elsevier and Prospect
Elsevier is the first publisher to fully integrate with the
Prospect beta system:
• Researchers may read and agree to the Elsevier TDM clickthrough agreement via the Prospect License Registry
• Researchers may use a Prospect token to access Elsevier
content through the Prospect Common API rather than
using the Elsevier-specific API Key and Elsevier API
• Content is available in the same formats as the Elsevier API

27
Outlook: Beyond Basic Access
Text Mining as a Service
• Pilot with NaCTeM to
integrate their tools with
Elsevier content
• Hosted in the cloud
• Avoids the need for
researchers to build and
maintain TDM infrastructure
• Ability to define and execute
TDM workflows in a graphical
environment

28
Takeaway Points
• Researchers at academic institutions can now textmine subscribed Elsevier content for noncommercial purposes at no additional cost
• Contact your Elsevier Account Manager if you are
interested
• Elsevier is collaborating with customers and industry
partners to make text mining easier

• More info at: http://www.elsevier.com/tdm
29
Thankyou!

c.shillum@elsevier.com
@cshillum
This work is licensed under a Creative Commons Attribution 4.0 International License.

30

Contenu connexe

En vedette

Text Mining for Second Screen
Text Mining for Second ScreenText Mining for Second Screen
Text Mining for Second Screen
Ivan Demin
 
Der Nobelpreis geht an: Vitamin C
Der Nobelpreis geht an: Vitamin CDer Nobelpreis geht an: Vitamin C
Der Nobelpreis geht an: Vitamin C
Dr Rath
 
Merit Event - Understanding and Managing Data Protection
Merit Event - Understanding and Managing Data ProtectionMerit Event - Understanding and Managing Data Protection
Merit Event - Understanding and Managing Data Protection
meritnorthwest
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
Slideshare
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
Saif Ullah
 

En vedette (18)

Text Mining for Second Screen
Text Mining for Second ScreenText Mining for Second Screen
Text Mining for Second Screen
 
Der Nobelpreis geht an: Vitamin C
Der Nobelpreis geht an: Vitamin CDer Nobelpreis geht an: Vitamin C
Der Nobelpreis geht an: Vitamin C
 
Semantische Systeme 3 0
Semantische Systeme 3 0Semantische Systeme 3 0
Semantische Systeme 3 0
 
Processing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, TikalProcessing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, Tikal
 
Legal issues Text and Data Mining
Legal issues Text and Data MiningLegal issues Text and Data Mining
Legal issues Text and Data Mining
 
Lecture1
Lecture1Lecture1
Lecture1
 
Personal Information Collection: A Trade-Off Analysis
Personal Information Collection: A Trade-Off AnalysisPersonal Information Collection: A Trade-Off Analysis
Personal Information Collection: A Trade-Off Analysis
 
Fundamentals of data security policy in i.t. management it-toolkits
Fundamentals of data security policy in i.t. management   it-toolkitsFundamentals of data security policy in i.t. management   it-toolkits
Fundamentals of data security policy in i.t. management it-toolkits
 
Big Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesBig Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and Issues
 
Merit Event - Understanding and Managing Data Protection
Merit Event - Understanding and Managing Data ProtectionMerit Event - Understanding and Managing Data Protection
Merit Event - Understanding and Managing Data Protection
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Mining
 
A business driven approach to security policy management a technical perspec...
A business driven approach to security policy management  a technical perspec...A business driven approach to security policy management  a technical perspec...
A business driven approach to security policy management a technical perspec...
 
Visual data mining with HeatMiner
Visual data mining with HeatMinerVisual data mining with HeatMiner
Visual data mining with HeatMiner
 
1.3 applications, issues
1.3 applications, issues1.3 applications, issues
1.3 applications, issues
 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With R
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
 
Data security in local network using distributed firewall ppt
Data security in local network using distributed firewall ppt Data security in local network using distributed firewall ppt
Data security in local network using distributed firewall ppt
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 

Similaire à Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining Policy

Lynch & Dirks - Platforms for Open Research - Charleston Conference 2011
Lynch & Dirks  - Platforms for Open Research - Charleston Conference 2011Lynch & Dirks  - Platforms for Open Research - Charleston Conference 2011
Lynch & Dirks - Platforms for Open Research - Charleston Conference 2011
Lee Dirks
 
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific PublicationsTowards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
petrknoth
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
Dr. Haxel Consult
 
How can repositories support the text mining of their content and why?
How can repositories support the text mining of their content and why?How can repositories support the text mining of their content and why?
How can repositories support the text mining of their content and why?
openminted_eu
 

Similaire à Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining Policy (20)

A Pragmatic Approach to Facilitating Text and Data Mining
A Pragmatic Approach to Facilitating Text and Data Mining A Pragmatic Approach to Facilitating Text and Data Mining
A Pragmatic Approach to Facilitating Text and Data Mining
 
Online Journal Management using Open Journal Systems (OJS)
Online Journal Management using Open Journal Systems (OJS)Online Journal Management using Open Journal Systems (OJS)
Online Journal Management using Open Journal Systems (OJS)
 
ufsojs-161024084446 (1).pdf
ufsojs-161024084446 (1).pdfufsojs-161024084446 (1).pdf
ufsojs-161024084446 (1).pdf
 
Webinar@AIMS on RIOXX
Webinar@AIMS on RIOXXWebinar@AIMS on RIOXX
Webinar@AIMS on RIOXX
 
Lynch & Dirks - Platforms for Open Research - Charleston Conference 2011
Lynch & Dirks  - Platforms for Open Research - Charleston Conference 2011Lynch & Dirks  - Platforms for Open Research - Charleston Conference 2011
Lynch & Dirks - Platforms for Open Research - Charleston Conference 2011
 
What Do Records Managers Need to Know About Open Source, Open Standards, Open...
What Do Records Managers Need to Know About Open Source, Open Standards, Open...What Do Records Managers Need to Know About Open Source, Open Standards, Open...
What Do Records Managers Need to Know About Open Source, Open Standards, Open...
 
OSFair2017 training | Machine accessibility of Open Access scientific publica...
OSFair2017 training | Machine accessibility of Open Access scientific publica...OSFair2017 training | Machine accessibility of Open Access scientific publica...
OSFair2017 training | Machine accessibility of Open Access scientific publica...
 
OAI-PMH
OAI-PMHOAI-PMH
OAI-PMH
 
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific PublicationsTowards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
 
Implementing web scale discovery services: special reference to Indian Librar...
Implementing web scale discovery services: special reference to Indian Librar...Implementing web scale discovery services: special reference to Indian Librar...
Implementing web scale discovery services: special reference to Indian Librar...
 
OpenChain at EOLE 2017
OpenChain at EOLE 2017OpenChain at EOLE 2017
OpenChain at EOLE 2017
 
2013 DataCite Summer Meeting - Elsevier's program to support research data (H...
2013 DataCite Summer Meeting - Elsevier's program to support research data (H...2013 DataCite Summer Meeting - Elsevier's program to support research data (H...
2013 DataCite Summer Meeting - Elsevier's program to support research data (H...
 
II-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in NiceII-SDV 2015, 20 - 21 April, in Nice
II-SDV 2015, 20 - 21 April, in Nice
 
How practising open research can benefit you
How practising open research can benefit youHow practising open research can benefit you
How practising open research can benefit you
 
A social computing strategy for an enterprise content ecosystem
A social computing strategy for an enterprise content ecosystemA social computing strategy for an enterprise content ecosystem
A social computing strategy for an enterprise content ecosystem
 
How can repositories support the text mining of their content and why?
How can repositories support the text mining of their content and why?How can repositories support the text mining of their content and why?
How can repositories support the text mining of their content and why?
 
Metadata Ownership & Metadata Rights
Metadata Ownership & Metadata RightsMetadata Ownership & Metadata Rights
Metadata Ownership & Metadata Rights
 
Multilingual Data Value Chain for CEF Automated Translation: Interoperability...
Multilingual Data Value Chain for CEF Automated Translation:Interoperability...Multilingual Data Value Chain for CEF Automated Translation:Interoperability...
Multilingual Data Value Chain for CEF Automated Translation: Interoperability...
 
NISO April 30th RA21 Webinar
NISO April 30th RA21 WebinarNISO April 30th RA21 Webinar
NISO April 30th RA21 Webinar
 
Attribution in the Research Lifecycle: Persistent identifiers
Attribution in the Research Lifecycle: Persistent identifiersAttribution in the Research Lifecycle: Persistent identifiers
Attribution in the Research Lifecycle: Persistent identifiers
 

Dernier

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Dernier (20)

MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 

Presentation from ALA Midwinter 2014 on Elsevier's new Text and Data Mining Policy

  • 1. Facilitating Text and Data Mining ALA Midwinter 26 Jan 2014 Chris Shillum VP, Platform and Content Elsevier A&G Research Markets
  • 2. Outline • • • • • Our new policy What is Text and Data Mining (TDM)? Timeline of TDM at Elsevier Practicalities Outlook 2
  • 3. Elsevier’s New Text and Data Mining Policy Researchers at academic institutions can text mine subscribed content on ScienceDirect for noncommercial purposes via the ScienceDirect APIs Access is granted to faculty, researchers, staff and students at the subscribing institution Text mining output can be shared publically under these conditions 1. May contain "snippets" of up to 200 characters of the original text 2. Should be licensed as CC-BY-NC 3. Should include DOI link to original content http://www.elsevier.com/tdm Corporate and other subscribers • Your Elsevier Account Manager will be happy to discuss options to meet your needs Open access content • Text and Data mining permission are determined by the author's choice of user license. This information is detailed in the individual articles 3
  • 4. Policy Aligned with the Recent STM Declaration on TDM http://www.stm-assoc.org/2013_11_11_Text_and_Data_Mining_Declaration.pdf 4
  • 5. Text and Data Mining – What is it? Data mining: “the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems” Text mining: essentially a subset/“input step” to data mining: turning text into structured data for analysis, via application of natural language processing (NLP) and analytical methods. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities). http://en.wikipedia.org/wiki/Data_mining 5
  • 6. TDM Examples – Entity Tagging and Linking http://dx.doi.org/10.1016/j.bbalip.2012.02.009 6
  • 7. TDM Examples – Information Extraction http://genome.ucsc.edu/ 7
  • 8. TDM Examples – Information extraction http://neuroelectro.org 8
  • 9. TDM Examples – Natural Language Modelling http://www.elsevier.com/online-tools/pathway-studio/ 9 9
  • 10. Text mining is Becoming an Essential Tool Blue Blain project • Reconstructing the brain piece by piece and building a virtual brain in a supercomputer • Received €500M (~$685m) of EU funding “The success of the Blue Brain project depends on very high volumes of standardized, high quality experimental data covering all possible levels of brain organization. Data comes both from the literature (via the project’s automatic information extraction tools) and from experimental work conducted by the project itself.” http://bluebrain.epfl.ch/page-58110-en.html http://bluebrain.epfl.ch/ 10
  • 11. Typical TDM Workflow Source: Sophia Ananiadou/National Center for Text Mining, University of Manchester, UK 11
  • 12. Supporting TDM at Elsevier 2006 … • Began to support ad-hoc TDM access requests from customers • Low but consistently increasing level of interest from early adopters as computing power increases and tools get better • First Content Mining policy published • New APIs and Content Syndication Service rolled out to 2012 provide better technical solutions for TDM content access • Pilot with ~30 academic customers to better understand needs and define future policy 2013 • New Text and Data Mining policy for academic customers announced 2014 12
  • 13. TDM Pilot Learnings – Use Cases Most academic TM requests fall in one or both of these categories: 1. Answering a specific research question 2. Building a new data resource for the community • How long does it take for concepts in STM literature to reach general media? • What is the relationship between the research and consulting commitments of economics and finance professors? • What are the characteristics of subjects in social psychology experiments? • An HIV mutation database for which mutations found in literature are mapped to the underlying database sequence • A database on growth and alimentation of fishes, and develop a fish classification to identify new species for aquaculture • A database with the electrophysiological properties of diverse neuron types 13
  • 14. TDM Pilot Learnings – Researcher Challenges Technical • Obtaining necessary infrastructure • Having to deal with different formats from content providers • Sourcing and understanding TDM technology Functional • Fine-tuning pipeline, curating output, representing output meaningfully Logistical/Legal • Gaining access to the needed content • Gaining permission to mine the content 14
  • 15. TDM Pilot Learnings – Library Challenges Expertise • Understanding specific TDM-based projects well enough to assess implications & offer advice to patrons Legal • Understanding and tracking what is allowed for what resources • Negotiating permissions with multiple providers • Ensuring academic freedom is protected Financial • Concerns about any additional costs • Understanding how TDM affects usage figures for the library 15
  • 16. TDM Pilot – Conclusions 1. Elsevier can offer enhanced value to ScienceDirect customers by including basic text and data mining access rights in subscription agreements 2. Self-service access to content for TDM via our APIs meets the needs of most researchers in academia 3. There is demand for services beyond basic content access that make text mining easier 16
  • 17. TDM Access: What do Institutions Need to do? • TDM access clause will be part of standard ScienceDirect subscription agreement for new academic customers and upon renewal • For existing agreements, an add-on contract amendment is available – just contact your Elsevier Account Manager • After signing institutional agreement/amendment, access to our API key registration page for your researchers will be enabled for your institution’s IP address range 17
  • 18. TDM Access: What do Researchers Need to do? 1. Register at http://developers.elsevier.com 18
  • 19. TDM Access: What do Researchers Need to do? 2. Accept a simple click-through agreement 19
  • 20. TDM Access: What do Researchers Need to do? 3. Obtain an Elsevier API Key 20
  • 21. TDM Access: What do Researchers Need to do? 4. Use API Key to retrieve full text of journal articles and book chapters via the Elsevier API • Elsevier XML and plain-text formats supported • We are looking into supporting specialized textmining friendly XML formats 5. Process retrieved full-text through text mining tools/workflow of choice Documentation available on developer portal 21
  • 22. Why an API? Can’t we just crawl ScienceDirect.com? • We are providing separate channels for machine-tomachine and human access to content • This helps us to maintain site performance, availability and reliability for everyone • Other major web information sources have similar policies • Wikipedia: http://en.wikipedia.org/wiki/Wikipedia:Database_download • Twitter: https://twitter.com/tos • PubMed Central: http://www.ncbi.nlm.nih.gov/pmc/about/copyright/ • People who text-mine prefer APIs!! 22
  • 23. Accessing an HTML page 23
  • 24. Accessing an XML document 24
  • 25. Supporting Cross-publisher TDM Prospect is a new service from CrossRef providing two components to address the issue of text and data mining scholarly literature across multiple publishers: • The “Prospect Common API” (PCAPI) can be used to access the full text of content identified by CrossRef DOIs across publisher sites and regardless of their business model. • The “Prospect License Registry” (PLR) can (optionally) be used by researchers and publishers as an efficient mechanism to provide “click-through” agreement of proprietary TDM licenses. • Both components are free to use by researchers and the public https://prospect.crossref.org/ 25
  • 27. Elsevier and Prospect Elsevier is the first publisher to fully integrate with the Prospect beta system: • Researchers may read and agree to the Elsevier TDM clickthrough agreement via the Prospect License Registry • Researchers may use a Prospect token to access Elsevier content through the Prospect Common API rather than using the Elsevier-specific API Key and Elsevier API • Content is available in the same formats as the Elsevier API 27
  • 28. Outlook: Beyond Basic Access Text Mining as a Service • Pilot with NaCTeM to integrate their tools with Elsevier content • Hosted in the cloud • Avoids the need for researchers to build and maintain TDM infrastructure • Ability to define and execute TDM workflows in a graphical environment 28
  • 29. Takeaway Points • Researchers at academic institutions can now textmine subscribed Elsevier content for noncommercial purposes at no additional cost • Contact your Elsevier Account Manager if you are interested • Elsevier is collaborating with customers and industry partners to make text mining easier • More info at: http://www.elsevier.com/tdm 29
  • 30. Thankyou! c.shillum@elsevier.com @cshillum This work is licensed under a Creative Commons Attribution 4.0 International License. 30

Notes de l'éditeur

  1. The requirement for licensing TDM output as CC-BY-NC only really applies to TDM Output insofar as it includes copyrightable material derived from the Elsevier content. For instance, we don’t require that any papers written about the TDM findings are published under CC-BY-NC.
  2. Elsevier is one of 16 publishers signing the declaration. Note that the Elsevier policy is _not_ limited only to the EU.
  3. TDM is a very large domain that encompasses many concepts. What follows are three examples of text mining-based operations on scholarly literature.
  4. Content tagging is not new – we (and librarians) have done that for years – manually. Thesauri are used to classify and index publications. Text mining technologies, however, allow us to that in automated ways, covering vast corpuses very quickly and increasingly accurately.
  5. Text mining is no longer purely the subject of research itself, nor just an experimental tool used by tech-savvy researchers. It is a core component of an increasing number of research projects and for many of those projects, text mining is critical for their success. An example of a project that uses text- and data mining extensively is the Blue Brain project, one of the flagship research projects in the European union and recipient of a 500 million euro (>650M US dollars) grant.
  6. Here is a typical text mining workflow, courtesy of NaCTeM at University of Manchester:A corpus of source documents is collected and preparedA part-of-speech tagger runs over the documents, breaking down the sentences by grammatical syntaxTerms are normalized: plural vs. singular, alternative spellings, etc.Acronyms are identifiedA value is calculated that expresses the accuracy of a specific term’s automatic recognitionExtracted terms are consolidated into a databaseThe source documents are annotated with the extracted, normalized terms, and the terms are analyzed
  7. Elsevier has supported requests from researchers for text mining access on an ad-hoc basis for several years. Demand has gradually increased as computing power increased and more tools became available. In 2012, we made investments to improve our technological support for these needs, and in 2013, ran a formal pilot to better understand researcher needs and help us design an appropriate technical and legal policy to meet those needs.
  8. One of the key learnings from the pilot was that academic needs for TDM typically fall into one of both of two categories.
  9. As solution for TDM is designed to address several of the key challenges researchers currently have: specifically, to make it easier to gain permission to mine and to access content in a simple and effective way suitable for text mining workflows.
  10. Librarians also face challenges in supporting TDM needs of their researchers. Our new policy is designed to allow librarians to fulfill those needs quickly and simply with no additional cost.
  11. Our experiences have led us to two conclusions: we need to a. support basic access for do-it-yourself text mining, and b. there is an opportunity for making text mining easier
  12. Probably good to mention that you’ll talk about Prospect later.
  13. Probably good to mention that you’ll talk about Prospect later.
  14. Probably good to mention that you’ll talk about Prospect later.
  15. Probably good to mention that you’ll talk about Prospect later.
  16. We also recognize that researchers typically want to mine content across multiple publishers, so we are leading participants in a new industry initiative designed to make that easier.