SlideShare une entreprise Scribd logo
1  sur  56
Télécharger pour lire hors ligne
Semantic Journal Mapping for Search Visualization
in a Large Scale Article Digital Library

              Glen Newton1,2, Alison Callahan1, Michel
                              Dumontier2
          1
           National Research Council Canada, 2Carleton University
        Second Workshop on Very Large Digital Libraries (VLDL) 2009
                               at ECDL 2009
                         Oct 2 2009 Corfu, Greece
Outline

•   Maps of Science
•   Background
•   Research Interests
•   Research Goals
•   Process
•   Scalability issues
•   Environment
•   Results
•   Conclusions
•   Future Work
From Bollen et al 2009 PLOS1
From Leydesdorff
From Leydesdorff & Rafols 2006   & Rafols 2006
From Leydesdorff & Rafols 2006
Background

• Canada Institute of Science and Technical Information (CISTI) ==
    Canadian national science library
• ~3000 active researchers at NRC
• Large full text collection of ~8.4m full-text + metadata articles, in
    science, technology, medicine (STM)
• 4100 journal titles
• ~1995 to 2009
Research Interests

• Domain-specific discovery
• Improved discovery in STM domains through results visualization
    and contextualization, browse/explore/refine
• Results set visualization: “mapping”
Research Goals

• Find way to extract journal (& article) semantic vector space
• Latent Semantic Analysis (LSA) works for small/medium sized
     corpora, does not scale to large scale of items and/or terms
• New alternative: Semantic Vectors (SV): uses random vectors &
     avoids expensive singular value decomposition (SVD)
• Can SV scale & generate sensible semantic vector space of
     journals on corpus of this size?
• Can the visualization produced be useful for results query
     visualization, refinement, discovery?
Corpus

• Licensed journal articles from STM publishers: Elsevier, Springer,
     etc
• ~4100 journal titles, classified into 23 categories (by librarians)
• ~8.4m journal articles
• Selection of articles/journals:
       – Only those with authors, abstract (no notices, obituaries, etc)
       – Only English language articles
       – Only journals with >50 articles in corpus
       – Resulting corpus: 5,733,721 articles from 2231 journals
       – Categories overlapping: 1.53 categories per journal
Category                                       # Journals
                                               per category
Agriculture & Biological Sciences              358
Arts and Humanities                            70
Biochemistry, Genetics and Molecular Biology   240
Business, Management and Accounting            106
Chemical Engineering                           126
Chemistry                                      226
Civil Engineering                              64
Computer Science                               218
Decision Science                               50
Earth and Planetary Science                    146
Economics, Econometrics and Finance            112
Category                       # Journals per category
Energy and Power               73
Engineering and Technology     328
Environmental Science          138
Immunology and Microbiology    104
Materials Science              160
Mathematics                    205
Medicine                       671
Neuroscience                   103
Pharmacology, Toxicology and   73
Pharmaceutics
Physics and Astronomy          210
Psychology                     126
Social Science                 222
Process

• Index full-text (only) with Lucene 2.4, aggressive stopword list,
     Porter stemming using LuSql tool
• Build Semantic Vectors (v1.18, parallelized) index from Lucene
     index, with 512 semantic dimensions
• Find item x item distance matrix from SV index of 512-
     dimensional vectors
• Using R, use multidimensional scaling (MDS) to reduce from 512-
     D to 2-D
Scalability Issues

•  #items, #unique terms
        – #unique terms: SV easily handles very well
        – #items: SV handles fairly well
        – #items: impacts size of distance matrix (#items x #items)
        – R cannot handle huge article distance matrix in MDS (i.e.
             millions of articles vs. thousands of journals)
• Instead of using articles for items, use journals for items
• Make single large full-text document from concatenation of all
      articles of particular journal & index these
Environment

• Dell PowerEdge 1955 Blade server, 2 x dual-core Xeon 5050
    processors with 2x2MB cache, 3.0 Ghz 64bit, 32GB RAM,
    attached to a Dell EMC AX150 storage arrays via SilkWorm
    200E Series 16-Port Capable 4Gb Fabric Switch.
• Operating system: Linux openSUSE 10.2 (64-bit X86-64), kernel
    2.6.18.8-0.10-default #1 SMP
• Java version 1.6.0.07 (build 1.6.0 07-b06) Java HotSpot 64-Bit
  Server VM (build 10.0-b23, mixed mode).
• Processing 1.0 (processing.org)
Results: Scalability

• Corpus: ~600GB full-text
• Lucene index: 43GB
      – LuSql: 13 hours 51 minutes to produce
• SV index: 58 minutes, 885 MB, 21.6m terms
      – Distance matrix: 6 minutes
Results: Visualization

• Using Processing environment, built simple
    validation/visualization tool
Harder sciences and
engineering categories
Chemistry
Material Science
Physics and
Astronomy
Engineering and
Technology
Mathematics
Computer Science
Civil Engineering
Chemical Engineering
Agriculture and
biomedical categories
Agriculture and
Biological Sciences
Biochemistry, Genetics
and Molecular Biology
Immunology and
Microbiology
Pharmacology
Neuroscience
Medicine
Medicine
Psychology
Interdisciplinary and
non-science categories
Environmental Science
Earth and
Planetary Science
Energy and Power
Decision Science
Economics,
Econometrics
And Finance
Social Sciences
Business, Management
and Accounting
Arts and Humanities
Examination of outliers,
extrema and cataloging
errors
Ecotoxicology and
Environmental Safety
                       Organic Geochemistry




                              Corporate Environmental
                              Strategy


                         Environmental Science
Journal of Biomolecular NMR



              Journal of X-Ray
              Science and Technology




           Medicine
           Medicine
Colloidal and
Polymer Science




                  Annales Henri Poincare




        Medicine
        Medicine
Medicine
         Medicine
French language Medical
& Psychology Journals
Bulletin of
              Mathematical Biology




Journal of
Medical
Ultrasonics




                 Mathematics
Conclusions

• Reasonable mapping results
• Full-text only (no citations, metadata) gives good results
• Scalable to significant size
Future Work

• Proper precision and recall evaluation using same corpus
• Validate with NetNews-20 collection for P & R
• Evaluate non-metric MDS
• Project articles onto semantic journal space & build interactive
    discovery interface & evaluate
       – Index journal 'documents' and journal articles
       – SV on all
       – Distance matrix only on journals
       – Do MDS
       – Use eigenvectors to transform N-d article vector to 2-D
• Explore 3-D interface (MDS N-d → 3D)
Acknowledgements

• Greg Kresko, Andre Vellino, Jeff Demaine @ NRC-CISTI
Demo

• Link to project demo page
License




Creative Commons Attribution-Noncommercial-No Derivative Works 2.5
Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library

Contenu connexe

Similaire à Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library

A new software tool for large-scale analysis of citation networks
A new software tool for large-scale analysis of citation networksA new software tool for large-scale analysis of citation networks
A new software tool for large-scale analysis of citation networks
Nees Jan van Eck
 
NASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & EngineeringNASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & Engineering
inside-BigData.com
 

Similaire à Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library (20)

A new software tool for large-scale analysis of citation networks
A new software tool for large-scale analysis of citation networksA new software tool for large-scale analysis of citation networks
A new software tool for large-scale analysis of citation networks
 
Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011Acs denver dirks potenzone 30 aug2011
Acs denver dirks potenzone 30 aug2011
 
10 Years of Multi-Label Learning
10 Years of Multi-Label Learning10 Years of Multi-Label Learning
10 Years of Multi-Label Learning
 
1. Intro DS.pptx
1. Intro DS.pptx1. Intro DS.pptx
1. Intro DS.pptx
 
NASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & EngineeringNASA Advanced Computing Environment for Science & Engineering
NASA Advanced Computing Environment for Science & Engineering
 
Modern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High PerformanceModern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High Performance
 
International Journal of Advanced Smart Sensor Network Systems (IJASSN)-Free ...
International Journal of Advanced Smart Sensor Network Systems (IJASSN)-Free ...International Journal of Advanced Smart Sensor Network Systems (IJASSN)-Free ...
International Journal of Advanced Smart Sensor Network Systems (IJASSN)-Free ...
 
EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017
 
Call for paper-International Journal of Advanced Smart Sensor Network Systems...
Call for paper-International Journal of Advanced Smart Sensor Network Systems...Call for paper-International Journal of Advanced Smart Sensor Network Systems...
Call for paper-International Journal of Advanced Smart Sensor Network Systems...
 
Call for paper-International Journal of Advanced Smart Sensor Network Systems...
Call for paper-International Journal of Advanced Smart Sensor Network Systems...Call for paper-International Journal of Advanced Smart Sensor Network Systems...
Call for paper-International Journal of Advanced Smart Sensor Network Systems...
 
International Journal of Advanced Smart Sensor Network Systems (IJASSN)free p...
International Journal of Advanced Smart Sensor Network Systems (IJASSN)free p...International Journal of Advanced Smart Sensor Network Systems (IJASSN)free p...
International Journal of Advanced Smart Sensor Network Systems (IJASSN)free p...
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
 
Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...Continuous modeling - automating model building on high-performance e-Infrast...
Continuous modeling - automating model building on high-performance e-Infrast...
 
International Journal of Advanced Smart Sensor Network Systems (IJASSN)
International Journal of Advanced Smart Sensor Network Systems (IJASSN)International Journal of Advanced Smart Sensor Network Systems (IJASSN)
International Journal of Advanced Smart Sensor Network Systems (IJASSN)
 
Call for presentation-International Journal of Advanced Smart Sensor Network ...
Call for presentation-International Journal of Advanced Smart Sensor Network ...Call for presentation-International Journal of Advanced Smart Sensor Network ...
Call for presentation-International Journal of Advanced Smart Sensor Network ...
 
Application of a Novel Subject Classification Scheme for a Bibliographic Data...
Application of a Novel Subject Classification Scheme for a Bibliographic Data...Application of a Novel Subject Classification Scheme for a Bibliographic Data...
Application of a Novel Subject Classification Scheme for a Bibliographic Data...
 
Science Mapping and Research Positioning
Science Mapping and Research PositioningScience Mapping and Research Positioning
Science Mapping and Research Positioning
 
Bme451 Fall07 Final
Bme451 Fall07 FinalBme451 Fall07 Final
Bme451 Fall07 Final
 
Call for paper-3rd International Conference on Big Data and Applications (BDA...
Call for paper-3rd International Conference on Big Data and Applications (BDA...Call for paper-3rd International Conference on Big Data and Applications (BDA...
Call for paper-3rd International Conference on Big Data and Applications (BDA...
 
Collins seattle-2014-final
Collins seattle-2014-finalCollins seattle-2014-final
Collins seattle-2014-final
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Semantic Journal Mapping for Search Visualization in a Large Scale Article Digital Library