Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Towards research data knowledge graphs

70 vues

Publié le

Talk at AIFB/Karlsruhe on 21.2.2020 on exploiting artificial and crowd intelligence towards constructing research data knowledge graphs.

Publié dans : Technologie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Towards research data knowledge graphs

  1. 1. Backup Beyond research data infrastructures - exploiting artificial & crowd intelligence for building research knowledge graphs Stefan Dietze GESIS – Leibniz Institute for the Social Sciences & Heinrich-Heine-Universität Düsseldorf AIFB / KIT, Karlsruhe 21.02.2020
  2. 2. Backup Beyond research data infrastructures - exploiting artificial & crowd intelligence for building research knowledge graphs Stefan Dietze GESIS – Leibniz Institute for the Social Sciences & Heinrich-Heine-Universität Düsseldorf LWDA2019, 02 October 2019 research data infrastructure data fusion distant supervision Web mining distributional semantics knowledge graph neural entity linking research data machine learning social web artificial intelligence semantics claim extraction stance detection fact verification crowd (Buzzword) Bingo !?
  3. 3. Finding research data on the Web? 17/03/20 3Stefan Dietze
  4. 4. Finding research data on the Web? 17/03/20 4Stefan Dietze
  5. 5. Finding research data on the Web? 17/03/20 5Stefan Dietze
  6. 6. Finding (social sciences) research data on the Web 17/03/20 6Stefan Dietze
  7. 7. Part I Retrieving, extracting and linking research data (in particular: metadata) on the Web Part II Mining novel forms of research data (KGs) from the Web 17/03/20 7Stefan Dietze Datasets Metadata Publications Web pages Opinions Claims Stances Overview
  8. 8. Web mining of dataset metadata (or: dataset KGs)  Harvesting from open data portals (e.g. DCAT/VoID- metadata on DataHub.io, DataCite etc.)  Information extraction on long tail of Web documents? => dynamics & scale: approx. 50 trn (50.000.000.000.000) Web pages indexed by Google (plus gazillion of temporal snapshots)  Embedded markup (RDFa, Microdata, Microformats) for annotation of Web pages  Supports Web search & interpretation  Pushed by Google, Yahoo, Bing et al (schema.org vocabulary)  Adoption on the Web by 38% all Web pages (sample: Common Crawl 2016, 3.2 bn Web pages)  Easily accesible, large-scale source of factual knowledge (about research data & research information)  Large-scale source of training data, e.g. manually annotated Web pages citing datasets Facts (“quads”) node1 name WB Commodity URI-1 node1 distribution node_xy URI-1 node1 creator Worldbank URI-1 node1 dateCreated 26 April 2017 URI-1 node2 creator World Bank URI-2 node2 encodingFormat text/CSV URI-2 node3 dateCreated 26 April 2007 URI-3 node3 keywords crude URI-3 <div itemscope itemtype ="http://schema.org/Dataset"> <h1 itemprop="name">World Bank-Commodity Prices</h1> <span itemprop=„distribution">URL-X</span> <span itemprop=„license">CC-BY</span> ... </div> 17/03/20 8Stefan Dietze
  9. 9. 17/03/20 9Stefan Dietze Research dataset markup on the Web  In Common Crawl 2017 (3.2 bn pages): o 14.1 M statements & 3.4 M instances related to „s:Dataset“ o Spread across 200 K pages from 2878 PLDs (top 10% of PLDs provide 95% of data)  Studies of scholarly articles and other types [SAVESD16, WWW2017]: majority of major publishers, data hosting sites, data registries, libraries, research organisations respresented power law distribution of dataset metadata across PLDs  Challenges o Errors. Factual errors, annotation errors (see also [Meusel et al, ESWC2015]) o Ambiguity & coreferences. e.g. 18.000 entity descriptions of “iPhone 6” in Common Crawl 2016 & ambiguous literals (e.g. „Apple“>) o Redundancies & conflicts vast amounts of equivalent or conflicting statements
  10. 10.  0. Noise: data cleansing (node URIs, deduplication etc)  1.a) Scale: Blocking through BM25 entity retrieval on markup index  1.b) Relevance: supervised coreference resolution  2.) Quality & redundancy: data fusion through supervised fact classification (SVM, knn, RF, LR, NB), diverse feature set (authority, relevance etc), considering source- (eg PageRank), entity-, & fact-level KnowMore: data fusion on markup 17/03/20 10 1. Blocking & coreference resolution 2. Fusion / Fact selection New Queries WorldBank, type:(Organization) Washington, type:(City) David Malpass, type:(Person) (supervised) Entity Description name “WorldBank Commodity Prices 2019” distribution Worldbank (node) releaseDate 26.04.2019 keywords „crude”, “prizes”, “market” encodingFormat text/CSV Query WorldBank Commodity, Prices 2019, type:(Dataset) Candidate Facts node1 name WB Commodity node1 distribution node_xy node1 creator Worldbank node1 dateReleased 26 April 2019 node2 creator World Bank node2 encodingFormat text/CSV node3 dateCreated 26 April 2007 node4 keywords “crude” Web page markup Web crawl (Common Crawl, 44 bn facts) approx. 125.000 facts for query [ s:Product, „iPhone6“ ] Stefan Dietze Yu, R., [..], Dietze, S., KnowMore-Knowledge Base Augmentation with Structured Web Markup, Semantic Web Journal 2019 (SWJ2019) Tempelmeier, N., Demidova, S., Dietze, S., Inferring Missing Categorical Information in Noisy and Sparse Web Markup, The Web Conf. 2018 (WWW2018)
  11. 11.  0. Noise: data cleansing (node URIs, deduplication etc)  1.a) Scale: Blocking through BM25 entity retrieval on markup index  1.b) Relevance: supervised coreference resolution  2.) Quality & redundancy: data fusion through supervised fact classification (SVM, knn, RF, LR, NB), diverse feature set (authority, relevance etc), considering source- (eg PageRank), entity-, & fact-level KnowMore: data fusion on markup 17/03/20 11 1. Blocking & coreference resolution 2. Fusion / Fact selection New Queries WorldBank, type:(Organization) Washington, type:(City) David Malpass, type:(Person) (supervised) Entity Description name “WorldBank Commodity Prices 2019” distribution Worldbank (node) releaseDate 26.04.2019 keywords „crude”, “prizes”, “market” encodingFormat text/CSV Query WorldBank Commodity, Prices 2019, type:(Dataset) Candidate Facts node1 name WB Commodity node1 distribution node_xy node1 creator Worldbank node1 dateReleased 26 April 2019 node2 creator World Bank node2 encodingFormat text/CSV node3 dateCreated 26 April 2007 node4 keywords “crude” Web page markup Web crawl (Common Crawl, 44 bn facts) approx. 125.000 facts for query [ s:Product, „iPhone6“ ] Stefan Dietze Yu, R., [..], Dietze, S., KnowMore-Knowledge Base Augmentation with Structured Web Markup, Semantic Web Journal 2019 (SWJ2019) Tempelmeier, N., Demidova, S., Dietze, S., Inferring Missing Categorical Information in Noisy and Sparse Web Markup, The Web Conf. 2018 (WWW2018) Fusion performance  Experiments on books, movies, products (ongoing: datasets)  Baselines: BM25, CBFS [ESWC2015], PreRecCorr [Pochampally et. al., ACM SIGMOD 2014], strong variance across types Knowledge Graph Augmentation  On average 60% - 70% of all facts new (across DBpedia, Wikidata, Freebase)  Additional experiments on learning new categorical features (e.g. product categories or movie genres) [WWW2018]
  12. 12. Applications: search for (SS) datasets, resources & relations 12Stefan Dietze https://search.gesis.org/ Dataset Rel. Publications Disambiguation of datasets, methods, software, authors, topics.
  13. 13. Rich Context & Coleridge Initiative building (yet another) KG of scholarly resources & datasets 13Stefan Dietze  Context/corpus: publications (currently: social sciences, SAGE Publishing)  Tasks: I. Extraction/disambiguation of dataset mentions II. Extraction/detection of research methods III. Classification of research fields https://coleridgeinitiative.org/richcontextcompetition
  14. 14. Disambiguation of dataset citations Otto, W. et al., Knowledge Extraction from scholarly publications – the GESIS contribution to the Rich Context Competition, to appear, Sage Publishing, 2020 14Stefan Dietze All these issues are addressed in the current report, which is based on analysis of data obtained in the National Comorbidity Survey (NCS) (15). The NCS is a nationally representative survey of the US household population that includes retrospective reports about the ages at onset and lifetime occurrences of suicidal ideation, plans, and attempts along with information about the occurrences of mental disorders, substance use, substance abuse, and substance dependence. National Comorbidity Survey (NCS) NCS Challenges  Ambiguous (incomplete) citations  Lack of high-quality and representative training data (usually: weak labels, domain bias) Approaches & results  Prior work: supervised pattern induction [Boland et al, TPDL2012]  Current approach: o neural NER based on spaCy (CRF-based approach for research method detection) o Training (testing) on 12.000 (3.000) paragraphs (distribution of negative/positive differs, training batch size=25, dropout=0.4) o Results approx. P = .50, R= .90 (weakly labelled test data) o On small set of manually labelled test data: P= .52; R= .21)
  15. 15. Profiling (Graph) Datasets Zloch, M., Acosta, M., Hienert, D., Dietze, S., Conrad, S., A Software Framework and Datasets for the Analysis of Graph Measures on RDF Graphs, ESWC19, Best Student Paper 15Stefan Dietze Motivation  Profiling datasets: extracting representative dataset metadata, e.g. to distinguish dataset of different kinds, find/discover datasets, generate synthetic datasets  Research question: what are effective graph metrics to profile graph-based research data (social graphs, knowledge graphs) Methods & results  Framework for profiling datasets based on 60 different graph metrics  Feature engineering (correlation analysis etc) and feature impact analysis  Certain datasets categories hard to describe/distinguish due to inherent diversity/variance of datasets  Set of descriptive, non-redundant dataset profile features varies for different dataset categories Feature homogeneity (lighter colour = more homogenous metric within domain) Feature impact in binary classification task (RF)
  16. 16. Beyond datasets: linking social sciences survey items F. Bensmann, A. Papenmeier, D. Kern, B. Zapilko, S. Dietze, Semantic Annotation, Representation and Linking of Survey Data, in progress 16Stefan Dietze Motivation  Surveys are costly: finding and reusing survey questions/items  Linking semantically related questions/responses across survey programmes: e.g. all questions/responses which evaluate the economic situation in Germany at present Approach & results  Taxonomy of question features & vocabulary for representing survey items & features  Initial ML models for predicting item features o Multiclass classification models for predicting information types (1st level: 3 classes, 2nd level: 9 classes) o LSTM, log. Regression, SVM, Naive Bayes, Random Forest o Reasonable performance, LSTM most robust Example from ALLBUS 2018 Example form ALLBUS ‘18 Q: “How would you rate the current economic conditions in Germany?” Family-Member Fact Cognition Self-focus Evaluation Object-focus Past Present Future ... Apartment Neighborhood Country <Continent> <Country> <City> Point in time Time span Periodic Point... Information Type Focus Time Reference Periodicity Relative Location Geo. Location
  17. 17. Overview Part I Retrieving, extracting and linking research data (in particular: metadata) on the Web Part II Mining novel forms of research data (KGs) from the Web 17/03/20 17Stefan Dietze Datasets Metadata Publications Web pages Opinions Claims Stances
  18. 18. Traditional & novel forms of research data: the case of social sciences 17/03/20 18Stefan Dietze  Traditional social science research data: survey & census data, microdata, lab studies etc (lack of scale, dynamics)  Social science vision: substituting & complementing traditional research data through data mined from the Web  Example: investigations into misinformation and opinion forming on Twitter (e.g. [Vousoughi et al. 2018])  Aims usually at investigating insights by also dealing with methodological/computational challenges  Insights, mostly (computational) social sciences, e.g. o Spreading of claims and misinformation o Effect of biased and fake news on public opinions  Methods, mostly in computer science, e.g. for o Crawling, harvesting, scraping of data o Extraction of structured knowledge (sentiments, stances, claims, etc) o Claim/fact detection and verification („fake news detection“), e.g. CLEF 2018 Fact Checking Lab o Stance detection, e.g. Fake News Challenge (FNC)
  19. 19. 17/03/20 19Stefan Dietze http://dbpedia.org/resource/Tim_Berners-Lee wna:positive-emotion onyx:hasEmotionIntensity "0.75" onyx:hasEmotionIntensity "0.0" Mining opinions & interactions (the case of Twitter)  Heterogenity: multimodal, multilingual, informal, “noisy” language  Context dependence: interpretation of tweets/posts (entities, sentiments) requires consideration of context (e.g. time, linked content), “Dusseldorf” => City or Football team  Dynamics & scale: e.g. 6000 tweets per second, plus interactions (retweets etc) and context (e.g. 25% of tweets contain URLs)  Evolution and temporal aspects: evolution of interactions over time crucial for many social sciences questions  Representativity and bias: demographic distributions not known a priori in archived data collections http://dbpedia.org/resource/Solid wna:negative-emotion P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18.
  20. 20. 17/03/20 20Stefan Dietze P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18. TweetsKB: a knowledge graph of Web mined “opinions” https://data.gesis.org/tweetskb/  Harvesting & archiving of 9 Bn tweets over 6 years (permanent collection from Twitter 1% sample since 2013)  Information extraction pipeline to build a KG of entities, interactions & sentiments (distributed batch processing via Hadoop Map/Reduce) o Entity linking with knowledge graph/DBpedia (Yahoo‘s FEL [Blanco et al. 2015]) (“president”/“potus”/”trump” => dbp:DonaldTrump), to disambiguate text and use background knowledge (eg US politicians? Republicans?), high precision (.85), low recall (.39) o Sentiment analysis/annotation using SentiStrength [Thelwall et al., 2017], F1 approx. .80 o Extraction of metadata and lifting into established schemas (SIOC, schema.org), publication using W3C standards (RDF/SPARQL)
  21. 21. 17/03/20 21Stefan Dietze P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18.  Harvesting & archiving of 9 Bn tweets over 5 years (permanent collection from Twitter 1% sample since 2013)  Information extraction pipeline (distributed via Hadoop Map/Reduce) o Entity linking with knowledge graph/DBpedia (Yahoo‘s FEL [Blanco et al. 2015]) (“president”/“potus”/”trump” => dbp:DonaldTrump), to disambiguate text and use background knowledge (eg US politicians? Republicans?), high precision (.85), low recall (.39) o Sentiment analysis/annotation using SentiStrength [Thelwall et al., 2012], F1 approx. .80 o Extraction of metadata and lifting into established schemas (SIOC, schema.org), publication using W3C standards (RDF/SPARQL) Use cases  Aggregating sentiments towards topics/entities, e.g. about CDU vs SPD politicians in particular time period  Twitter archives as general corpus for understanding temporal entity relatedness (e.g. “austerity” & “Greece” 2010-2015)  Investigating spreading & impact of fake news (e.g. TweetsKB, ClaimsKG, stance detection) Limitations  Bias & representativity: demographic distributions of users (not known a priori and not representative) -0.40000 -0.30000 -0.20000 -0.10000 0.00000 0.10000 0.20000 0.30000 0.40000 Cologne Düsseldorf https://data.gesis.org/tweetskb/ TweetsKB: a knowledge graph of Web mined “opinions”
  22. 22. 17/03/20 23Stefan Dietze Mining knowledge about claims and stances stance, claim trustworthiness? stance, claim trustworthiness?
  23. 23. Detecting stances towards claims/opinions Motivation  Problem: detecting stance of documents (e.g. Web pages, scientific publication) towards a given claim (unbalanced class distribution)  Motivation: stance of documents (in particular disagreement) useful (a) as signal for truthfulness (fake news detection) and (b) Document or Source classification (PLDs, publishers) Approach  Cascading binary classifiers: addressing individual issues (e.g. misclassification costs) per step  Features, e.g. textual similarity (Word2Vec etc), sentiments, LIWC, etc.  Best-performing models: 1) SVM with class-wise penalty, 2) CNN, 3) SVM with class-wise penalty  Experiments on FNC-1 dataset (and FNC baselines) Results  Minor overall performance improvement  Improvement on disagree class by 27% (but still far from robust) A. Roy, A. Ekbal, S. Dietze, P. Fafalios, Exploiting stance hierarchies for cost-sensitive stance detection of Web documents, JCDL2020 under review. 24Stefan Dietze
  24. 24. 17/03/20 25Stefan Dietze ClaimsKG: a knowledge graph of claims and claim-related metadata Motivation  Claims spread across various (unstructured) fact-checking sites  Example: finding claims about / made by US republican politicians across the Web? Approach  Harvesting claims & metadata from fact- checking sites (e.g. snopes.com, Politifact.com etc); currently approx. 30.000 claims (plus mining schema.org/ClaimReview markup (> 500.000 statements in Common Crawl 2017)  Information extraction & linking o Linking mentioned entities to DBpedia o Normalisation of ratings (true, false, mixture, other); coreference resolution of claims o Exposing data through established vocabulary and W3C standards (e.g. SPARQL endpoint) https://data.gesis.org/claimskg/ A. Tchechmedjiev, P. Fafalios, K. Boland, S. Dietze, B. Zapilko, K. Todorov, ClaimsKG – A Live Knowledge Graph of fact-checked Claims, ISWC2019
  25. 25. Conclusions & open challenges Retrieving, extracting, linking of research dataset metadata (KGs)  Mining of unstructured Web pages and scholarly articles for research datasets & metadata  Profiling of research datasets for discovery, sampling, generation of synthetic data  Plenty of related initiatives and efforts (e.g. Rich Context, Research Graph, OpenAIRE, ORKG)  Challenges: transparent/reproducible/reusable methods for extraction & mining across domains and corpora Mining and sharing novel forms of research data (KGs)  Mining the Web for novel forms of research data  Examples from social sciences: opinions (sentiments on entities) and interactions on Twitter & structured knowledge about resource relations (for instance: stances) and claims  Challenges: language understanding/interpretation, representativity and bias 17/03/20 26Stefan Dietze
  26. 26. Acknowledgements • Maribel Acosta (KIT, Karlsruhe) • Felix Bensmann (GESIS) • Katarina Boland (GESIS, Germany) • Stefan Conrad (HHU, Germany) • Elena Demidova (L3S, Germany) • Dimitar Dimitrov (GESIS, Germany) • Asif Ekbal (IIT Patna, India) • Pavlos Fafalios (FORTH ICS, Greece) • Daniel Hienert (GESIS, Germany) • Vasileios Iosifidis (L3S, Germany) • Dagmar Kern (GESIS, Germany) • Eirini Ntoutsi (LUH, Germany) • Vasilis Iosifidis (L3S, Germany) • Wolfgang Otto (GESIS, Germany) • Andrea Papenmeier (GESIS, Germany) • Markus Rokicki (L3S, Germany) • Arjun Roy (IIT Patna, India) • Renato Stoffalette Joao (L3S, Germany) • Nicolas Tempelmeier (L3S, Germany) • Konstantin Todorov (LIRMM, France) • Ran Yu (GESIS, Germany) • Benjamin Zapilko (GESIS, Germany) • Matthäus Zloch (GESIS, Germany) 17/03/20 27Stefan Dietze
  27. 27. 28Stefan Dietze Knowledge Technologies for the Social Sciences (WTS) https://www.gesis.org/en/institute/departments/knowledge-technologies-for-the-social-sciences/ WTS Labs https://www.gesis.org/en/research/applied-computer-science/labs/wts-research-labs Data & Knowledge Engineering @ HHU https://www.cs.hhu.de/en/research-groups/data-knowledge-engineering.html L3S http://www.l3s.de Personal http://stefandietze.net

×