R. Koopman, S. Wang, A. Scharnhorst (2015) Between information retrieval services and bibliometrics research. New ways of semantic browsing and visual analytics. Presentation at the Sigmetrics workshop, ASIST 2015, November 7, 2015 St. Louis, Missouri
Between information retrieval services and bibliometrics research. New ways of semantic browsing and visual analytics
1. Between information retrieval services and
bibliometrics research
–
new ways of semantic browsing and visual analytics
Rob Koopman, Shenghui Wang
OCLC Research
Andrea Scharnhorst
DANS- KNAW
November 7, 2015
ASIST, sigmetrics workshop
2. Content
- New approach to find structure in
bibliographic information – ARIADNE (2 Method)
- Applications:
- Data curation – author disambiguation (1 Motivation)
- Illustration of topics – the case of digital humanities
Topical browsing – DEMO (3)
- Excursion into bibliometrics – the Berlin group challenge
(4)
- Wrapping up (5)
4. Mapping topics, communities, research
fronts, …..
Bibliometrics
Documents are similar because
they:
- Cite each other
- Are cited together
- Use the same references
- Use the same vocabulary
- Have the same authors
Information retrieval
Documents are similar because
they:
- Use the same vocabulary
- - ….
ARIADNE is about similarity of entities!
5. Document/work, Record and Entity
…
Authors Title Journal … Reference Subject
Authors
names
Topical terms
Reference
Journal
Glänzel, W.
Glanzel, W.
bibliometrics
…
…
citations … Casimir effect
N=SUM (doc)
8. Dataset
● WorldCat, 300+ million records
● Selected 13 million items (topical terms,
authors, ISSNs, Dewey decimal codes,
publishers, subject headings)
● Represented by 6 million topical terms
But a matrix of 13M x 6M is too big to process
9. C: a co-occurrence matrix
R: a random matrix of +/-1
C’: approximation of C
after random projection
-- Semantic matrix
Koopman, R., Wang, S., Scharnhorst, A., Englebienne, G.: Ariadne’s thread: In- teractive navigation in a world of networked information. In: CHI’15 Extended Abstracts.
Step 1: Building the semantic matrix
– and Dimension reduction based on Random Projection
10. Step 2: Interactive exploration
- Provide a simple search/text box
- Calculate the top 500 most related
candidates
- Find mutually related items
- Convert distances to probabilities
- Project to 2D
- Enhance interface with links to other spaces
11. Exploration of a topic
http://thoth.pica.nl/relate?input=hirsch%20index&fsize=100&ncluster=
12.
13.
14. EINS 1st PLENARY
Digital libraries
Science, Computer
Science, ontologies
Many different humanities fields
Prominently language &
Literary studies
Illustration of context around a
topic/field – journal view
Koopman, R., Wang, S., Scharnhorst, A., Englebienne, G.: Ariadne's thread:
Interactive navigation in a world of networked information. In: CHI'15 Extended
Abstracts. (2015)
16. Wrapping up – future work
● Compare the algorithm to other existing algorithms – benchmarking
● More metadata fields (publisher, subject, identifiers) – ongoing
● Identify further problems to which Ariadne can be applied
● Curation (e.g. author name disambiguation);
● Knowledge discovery (e.g. matching chemical molecules);
● Information science – population of libraries, subject areas, …
● Feedback from users – Prepare user scenarios for usability testing
and set up an evaluation project – tbd
● Improve visualisation
● More functionality (timeline, history)
● Extend the implementation to other databases
18. References
Koopman, R., Wang, S., Scharnhorst, A., Englebienne, G.: Ariadne's thread: Interactive
navigation in a world of networked information. In: B. Begole, J. Kim, K. Inkpen, W. Woo
(eds.) Proceedings of the 33rd Annual ACM Conference Extended Abstracts on Human
Factors in Computing Systems, Seoul, CHI 2015 Extended Abstracts, Republic of Korea, April 18 - 23,
2015, pp. 1833{1838. ACM (2015). DOI 10.1145/2702613.2732781. URL
http://doi.acm.org/10.1145/2702613.2732781 (Preprint Arxiv.org)
Koopman, R., Wang, S., Scharnhorst, A.: Contextualization of Topics - Browsing through
Terms, Authors, Journals and Cluster Allocations. In: A.A. Salah, Y. Tonta, A.A.A.
Salah, C. Sugimoto, U. Al (eds.) Proceedings of ISSI 2015 Istanbul. 15th International
Society of Scientometrics and Informetrics Conference, Istanbul, Turkey, 29th June to 4th
July 2015, pp. 1042{1053. Boazici University Printhouse, Istanbul (2015). URL http:
//www.issi2015.org/en/Proceedings-of-ISSI-2015.html
Notes de l'éditeur
[snapshot around an author as Loet Leydesdorff, Wolfgang Glaenzel]
The idea at the beginning was: would one and the same author not have a similar ‘semantic’ fingerprints in s/he’s scholarly communication if we look into an article database, or book production if we look at WorldCat. The latter of course is a more complex problem, because the signal is weaker. Authors have usual less book publications than that they produce article.
This network shows nodes representing author names, other words, journals … the links between them represent some similarity in terms of their lexical profile.
Let me explain this in more detail.
At the end we get document-document matrices: symmetric matrices, retrieved from asymmetric matrices such as documents-references; documents-authors; documents-words.
In all those cases the unit of analysis is the document as represented by the bibliographic record – and the counterpart are elements of this record – or additional information in the document such as the references.
In different bibliographic systems we find descriptions of works (articles, journals, objects, …) in form of a (classical) bibliographic record and often with additional information.
In a first step we deconstruct the bibliographic record+ and extract categories of entities such as author names, journals names, subject headings, Dewey and other classifications.
In a second step we ask how often these entities appear with topical terms. In other word, we construct now a word space not for the documents, but for the extracted entities from the document record. Documents are still relevant, because for calculating the co-occurrence of an entity and a topical term we go through all documents and count how often a certain author name and a topical word appear together. The resulting vector we call a semantic representation of an entity. Returning to our motivation: if an author is the same but spelled differently, we would assume that – in large corpus of documents, her semantic representation would be very similar. What we construct is a co-occurrence matrix between entities and topical terms. From this martix we can derive a similarity matrix between entities – taking the cosine of the vectors as measure. This similarity matrix can be visualized in form of a network, where entities are nodes and in any visual representation of this similarity the two ‘authors’ would be near to each other.
We can of course do all this kind of analysis because we have standardized and digitized information. Ariadne has been developed around MARC records in different information servics, OCLC provides. ArticleFirst – this is were the demonstrator runs now; WorldCat – we did an exploration in this. But, in principle it can be applied to any database/set. An example you have seen in Theresa’s presentation
If you never heard of the Hirsch Index – where does it belong to? What other terms are around it?
What are the different aspects of this topic?
Are there related aspects missing in my search terms?
Who are the most prominent authors about this topic?
Which journals publish most about this topic?
How have others — e.g. librarians — described and classified this topic?
In the case of the h-index there is a wikipedia entry which is much more detailed, still,
ARIADNE gives you a first orientation
Ariadne search into ArticleFirst a database from OCLC gives us an indication which journals are involved and based on this which fields are involved, not so surprising
Ariadne search into ArticleFirst a database from OCLC gives us an indication which journals are involved and based on this which fields are involved, not so surprising