SlideShare a Scribd company logo
1 of 155
Download to read offline
University of Sheffield, NLP
Text Analysis with GATE
Diana Maynard
University of Sheffield
Search Solutions Tutorial 2015
London, UK
© The University of Sheffield, 2015
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike Licence
University of Sheffield, NLP
Outline of the Tutorial
• Introduction to NLP and Information Extraction
• Introduction to GATE
• Social media analysis
• Text Analysis for Semantic Search
• Semantic Search
• Semantic Annotation
• GATE MIMIR
• Example applications
• ExoPatent, TNA Web Archive, BBC News, Envilod,
Prospector, Political Futures Tracker
University of Sheffield, NLP
1. NLP and Information Extraction
University of Sheffield, NLP
Oddly enough, people have successfully combined
information and toast...
University of Sheffield, NLP
The weather-forecasting toaster
● This weather-forecasting toaster, connected to a phone point,
was designed in 2001 by a PhD student
● It accessed the MetOffice website via a modem inside the
toaster and translated the information into a 1, 2 or 3 for rain,
cloud or sun
● The relevant symbol was then branded into the toast in the last
few seconds of toasting
University of Sheffield, NLP
However, toast isn't actually a very good medium for finding
information
7
University of Sheffield, NLP
Information overload or filter failure?
● We all know that there's
masses of information
available online these days,
and it's constantly growing
● You often hear people talk
about “information overload”
● But the real problem is not
the amount of information,
but our inability to filter it
correctly
● Clay Shirky has an excellent
talk on this topic
http://bit.ly/oWJTNZ
University of Sheffield, NLP
NLP gives us a way to understand data
● sort the data to remove the rubbish from the interesting parts
● extract the relevant pieces of information
● link the extracted information to other sources of information
(e.g. from DBpedia)
● aggregate the information according to potential new
categories
● query the (aggregated) information
● visualise the results of the query
University of Sheffield, NLP
What is information extraction?
● The automatic discovery of new, previously unknown
information, by automatically extracting information from
different textual resources.
● A key element is the linking together of the extracted
information to form new facts or new hypotheses to be
explored further
● In IE, the goal is to discover previously unknown information,
i.e. something that no one yet knows
10
University of Sheffield, NLP
IE is not Data Mining
Examples:
• using consumer purchasing
patterns to predict which products
to place close together on shelves
in supermarkets
• analysing spending patterns on
credit cards to detect fraudulent
card use.
Data mining is about using analytical techniques to find
interesting patterns from large structured databases
11
University of Sheffield, NLP
IE is not Web Search
● IE is also different from traditional web search or IR.
● In search, the user is typically looking for something that is
already known and has been written by someone else.
● The problem lies in sifting through all the material that currently
isn't relevant to your needs, in order to find the information that
is.
● The solution often lies in better ways to ask the right question
● You can't ask Google to tell you about
– all the speeches made by Tony Blair about foot and mouth
disease while he was Prime Minister
– all the documents in which a politician born in Sheffield is
quoted as saying something about hospitals.
University of Sheffield, NLP
GATE Mímir:
Answering Questions
Google Can't
More about how we can do this will be revealed later...
University of Sheffield, NLP
Information Extraction Basics
● Entity recognition
is required for
● Relation extraction
which is required for
● Event recognition
which is required for
● Summarisation tasks (and other things)
University of Sheffield, NLP
What is Entity Recognition?
●
Entity Recognition is about recognising and classifying key
Named Entities and terms in the text
●
A Named Entity is a Person, Location, Organisation, Date etc.
●
A term is a key concept or phrase that is representative of the text
●
Entities and terms may be described in different ways but refer to
the same thing. We call this co-reference.
Mitt Romney, the favorite to win the Republican nomination for president in 2012
DatePerson Term
The GOP tweeted that they had knocked on 75,000 doors in Ohio the day prior.
Organisation
co-reference
Location
University of Sheffield, NLP
What is Event Recognition?
●
An event is an action or situation relevant to the domain
expressed by some relation between entities or terms.
●
It is always grounded in time, e.g. the performance of a
band, an election, the death of a person
Mitt Romney, the favorite to win the Republican nomination for president in 2012
Event DatePerson
Relation Relation
University of Sheffield, NLP
Why are Entities and Events Useful?
● They can help answer the “Big 5” journalism questions
(who, what, when, where, why)
● They can be used to categorise the texts in different ways
– look at all texts about Obama.
● They can be used as targets for opinion mining
– find out what people think about President Obama
● When linked to an ontology and/or combined with other
information, they can be used for reasoning about things
not explicit in the text
– seeing how opinions about different American
presidents have changed over the years
17
University of Sheffield, NLP
NLP components for text mining
● A text mining system is usually built up from a number of
different NLP components
● First, you need some Information Extraction tools to do the
donkey work of getting all the relevant pieces of information
and facts.
● Then you need some tools to apply the reasoning to the facts,
e.g. opinion mining, information aggregation, semantic
technologies, dynamics analysis
● GATE is an example of a tool for text mining which allows you
to combine all the necessary NLP components
18
University of Sheffield, NLP
Low-level linguistic processing components
●
These are needed to pre-process the text ready for the more
complex IE tasks (NER, relations etc.)
University of Sheffield, NLP
Approaches to Information Extraction
Knowledge Engineering

rule based

developed by
experienced language
engineers

make use of human
intuition

easier to understand
results

development could be
very time consuming

some changes may be
hard to accommodate
Learning Systems

use statistics or other
machine learning

developers do not need
LE expertise

requires large amounts of
annotated training data

some changes may
require re-annotation of
the entire training corpus
University of Sheffield, NLP
2. GATE
University of Sheffield, NLP
What is GATE?
GATE is an NLP toolkit developed at the University of Sheffield
over the last 20 years
It includes:
• components for language processing, e.g. parsers, machine
learning tools, stemmers, IR tools, IE components for various
languages...
• tools for visualising and manipulating text, annotations,
ontologies, parse trees, etc.
• various information extraction tools
• evaluation and benchmarking tools
University of Sheffield, NLP
GATE components
● Language Resources (LRs), e.g. lexicons, corpora,
ontologies
● Processing Resources (PRs), e.g. parsers, generators,
taggers
● Visual Resources (VRs), i.e. visualisation and editing
components
● Algorithms are separated from the data, which means:
– the two can be developed independently by users with
different expertise.
– alternative resources of one type can be used without
affecting the other, e.g. a different visual resource can be
used with the same language resource
University of Sheffield, NLP
ANNIE
• ANNIE is GATE's rule-based IE system
• It uses the language engineering approach (though we also
have tools in GATE for ML)
• Distributed as part of GATE
• Uses a finite-state pattern-action rule language, JAPE
• ANNIE contains a reusable and easily extendable set of
components:
– generic pre-processing components for tokenisation,
sentence splitting etc
– components for performing NE on general open domain text
University of Sheffield, NLP
What's in ANNIE?
• The ANNIE application contains a set of core PRs:
• Tokeniser
• Sentence Splitter
• POS tagger
• Gazetteers
• Named entity tagger (JAPE transducer)
• Orthomatcher (orthographic coreference)
• There are also other PRs available in the ANNIE plugin, which
are not used in the default application, but can be added if
necessary
• NP and VP chunker, morphological analysis
University of Sheffield, NLP
ANNIE Modules
University of Sheffield, NLP
Let's look at the PRs
• Each PR in the ANNIE pipeline creates some new
annotations, or modifies existing ones
• Document Reset → removes annotations
• Tokeniser → Token annotations
• Gazetteer → Lookup annotations
• Sentence Splitter → Sentence, Split annotations
• POS tagger → adds category features to Token
annotations
• NE transducer → Date, Person, Location, Organisation,
Money, Percent annotations
• Orthomatcher → adds match features to NE annotations
University of Sheffield, NLP
Tokeniser

Tokenisation based on Unicode classes

Declarative token specification language

Produces Token and SpaceToken annotations with
features orthography and kind

Length and string features are also produced

Rule for a lowercase word with initial uppercase
letter:
  
"UPPERCASE_LETTER" LOWERCASE_LETTER"* > 
  Token; orthography=upperInitial; kind=word
University of Sheffield, NLP
28
Document with Tokens
University of Sheffield, NLP
ANNIE English Tokeniser

The English Tokeniser is a slightly enhanced version of the
Unicode tokeniser

It comprises an additional JAPE transducer which adapts
the generic tokeniser output for the POS tagger
requirements

It converts constructs involving apostrophes into more
sensible combinations

don’t → do + n't

you've → you + 've
University of Sheffield, NLP
Sentence Splitter
• The default splitter finds sentences based on Tokens
• Creates Sentence annotations and Split annotations on the
sentence delimiters
• Uses a gazetteer of abbreviations etc. and a set of JAPE
grammars which find sentence delimiters and then annotate
sentences and splits
• There are a few sentence splitter variants available which work
in slightly different ways
University of Sheffield, NLP
31
Document with Sentences
University of Sheffield, NLP
POS tagger

ANNIE POS tagger is a Java implementation of Brill's
transformation based tagger

Previously known as Hepple Tagger

Trained on WSJ, uses Penn Treebank tagset

Default ruleset and lexicon can be modified manually (with a
little deciphering)

Adds category feature to Token annotations

Requires Tokeniser and Sentence Splitter to be run first
Morphological analyser

Not an integral part of ANNIE, but can be found in the
Tools plugin as an “added extra”

Flex based rules: can be modified by the user
(instructions in the User Guide)

Generates “root” feature on Token annotations

Requires Tokeniser to be run first

Requires POS tagger to be run first if the
considerPOSTag parameter is set to true
University of Sheffield, NLP
Gazetteers
●
Gazetteers are plain text files containing lists of names (e.g
rivers, cities, people, …)
●
The lists are compiled into Finite State Machines
●
Each gazetteer has an index file listing all the lists, plus
features of each list (majorType, minorType and language)
●
Lists can be modified either internally using the Gazetteer
Editor, or externally in your favourite editor
●
Gazetteers generate Lookup annotations with relevant features
corresponding to the list matched
●
You can also add extra features and values to individual lists
●
Lookup annotations are used primarily by the NE transducer
●
The ANNIE gazetteer has about 60,000 entries arranged in 80
lists
University of Sheffield, NLP
Gazetteer editor
definition file
entries entries for selected list
University of Sheffield, NLP
Named Entity Recognition
• Gazetteers can be used to find terms that suggest entities
• However, the entries can often be ambiguous
– “May Jones” vs “May 2010” vs “May I be excused?”
– “Mr Parkinson” vs “Parkinson's Disease”
– “General Motors” vs. “General Smith”
• Handcrafted grammars are used to define patterns over the
Lookups and other annotations
• These patterns can help disambiguate, and they can combine
different annotations, e.g. Dates can be comprised of day +
number + month
University of Sheffield, NLP
Named Entity Grammars
• Hand-coded rules written in JAPE applied to annotations to
identify NEs
• Phases run sequentially and constitute a cascade of FSTs over
annotations
• Annotations from format analysis, tokeniser. splitter, POS tagger,
morphological analysis, gazetteer etc.
• Because phases are sequential, annotations can be built up over
a period of phases, as new information is gleaned
• Standard named entities: persons, locations, organisations, dates,
addresses, money
• Basic NE grammars can be adapted for new applications, domains
and languages
University of Sheffield, NLP
Example of a JAPE pattern-action rule
Rule: PersonName
(
{Lookup.majorType == firstname}
{Token.category == NNP}
):tag
-->
:tag.Person = {kind = fullName}
Look for an entry in a
gazetter list of first
names
followed by a proper noun
create an annotation of type “Person” give the annotation the feature kind
and value “fullName”
University of Sheffield, NLP
JAPE rules are usually a bit more complex
● They can make use of any annotations already created, e.g.
– strings of text
– POS categories
– gazetteer lookup
– root forms of words
– document metadata, e.g. HTML tags
– annotations “in context”
● They can use regular expression operators, including negative
constraints, “contains”, “within” etc.
● And the RHS of a rule can contain any Java code you like
University of Sheffield, NLP
Document with NEs
Using co-reference

Different expressions may refer to the same entity

Orthographic co-reference module (orthomatcher)
matches proper names and their variants in a
document

[Mr Smith] and [John Smith] will be matched as the
same person

[International Business Machines Ltd.] will match
[IBM]
Orthomatcher PR
• Performs co-reference resolution based on orthographical
information of entities
• Produces a list of annotation IDs that form a co-reference “chain”
• List of such lists stored as a document feature named
“MatchesAnnots”
• Improves results by assigning entity type to previously unclassified
names, based on relations with classified entities
• May not reclassify already classified entities
• Classification of unknown entities very useful for surnames which
match a full name, or abbreviations,
e.g. “Bonfield” <Unknown> will match “Sir Peter Bonfield” <Person>
• A pronominal PR is also available
University of Sheffield, NLP
Co-reference in GATE
University of Sheffield, NLP
Other NLP Toolkits
• UIMA
• OpenCalais
• Lingpipe
• OpenNLP
• Stanford Tools
All integrated into GATE as plugins
University of Sheffield, NLP
3. Analysing Social Media
University of Sheffield, NLP
Analysing language in social media is hard
● Grundman:politics makes #climatechange scientific issue,people
don’t like knowitall rational voice tellin em wat 2do
● @adambation Try reading this article , it looks like it would be
really helpful and not obvious at all. http://t.co/mo3vODoX
● Want to solve the problem of #ClimateChange? Just #vote for a
#politician! Poof! Problem gone! #sarcasm #TVP #99%
● Human Caused #ClimateChange is a Monumental Scam!
http://www.youtube.com/watch?v=LiX792kNQeE … F**k yes!!
Lying to us like MOFO's Tax The Air We Breath! F**k Them!
University of Sheffield, NLP
Challenges for NLP
● Noisy language: unusual punctuation, capitalisation, spelling, use
of slang, sarcasm etc.
● Terse nature of microposts such as tweets
● Use of hashtags, @mentions etc causes problems for
tokenisation #thisistricky
● Lack of context gives rise to ambiguities
● NER performs poorly on microposts, mainly because of linguistic
pre-processing failure
– Performance of standard IE tools decreases from ~90% to
~40% when run on tweets rather than news articles
University of Sheffield, NLP
Persons in news articles
University of Sheffield, NLP
Persons in tweets
University of Sheffield, NLP
Lack of context causes ambiguity
Branching out from Lincoln park after dark ... Hello Russian
Navy, it's like the same thing but with glitter!
??
University of Sheffield, NLP
Getting the NEs right is crucial
Branching out from Lincoln park after dark ... Hello Russian
Navy, it's like the same thing but with glitter!
University of Sheffield, NLP
How do we deal with this kind of text?
● Typical NLP pipeline means that degraded performance has a
knock-on effect along the chain
● Short sentences confuse language identification tools
● Linguistic processing tools have to be adapted to the domain
● Retraining individual components on large volumes of data
● Adaptation of techniques from e.g. SMS analysis
● Development of new Twitter-specific tools (e.g. GATE's TwitIE)
● But....lack of standards, easily accessible data, common
evaluation etc. are holding back development
University of Sheffield, NLP
TwitIE to the rescue
University of Sheffield, NLP
4. Text Analytics for Semantic Search
University of Sheffield, NLP
Text Search isn't Enough
“I like the Internet. Really, I do. Any time I
need a piece of shareware or I want to find
out the weather in Bogota... I'm the first guy
to get the modem humming. But as a
source of information, it sucks. You got a
billion pieces of data, struggling to be heard
and seen and downloaded, and anything I
want to know seems to get trampled
underfoot in the crowd."
Michael Marshall, The Straw Men. HarperCollins
Publishers, 2002.
University of Sheffield, NLP
Semantic Queries in Google
University of Sheffield, NLP
Searching for Things, Not Strings
http://googleblog.blogspot.it/2012/05/introducing-knowledge-graph-things-not.html
• 500 million entities that Google
“knows” about
• Used to provide more accurate
search results
• Summaries of information about
the entity being searched
University of Sheffield, NLP
Facebook Graph Search
University of Sheffield, NLP
Semantic Enrichment
● Textual mentions aren't actually that useful in isolation
– knowing that something is a “Person" isn't very helpful
– knowing which Person the mention refers to can be very useful
● Disambiguating mentions against an ontology provides extra context
● This is where semantic enrichment comes in
● The end product is a set of textual mentions linked to an ontology,
otherwise known as semantic annotations
● Annotations on their own can be useful but they can also
– be used to generate corpus level statistics
– be used for further ontology population
– form the basis of summaries
– be indexed to provide semantic search
University of Sheffield, NLP
Automatic Semantic Enrichment
• Use Text Mining, e.g.
• Information Extraction – recognise names of people,
organisations, locations, dates, references, etc.
• Term recognition – identify domain-specific terms
• Automatically extend article metadata to improve search quality
• Example: using a customised GATE text mining pipeline to enrich
metadata in the Envia environmental science repository
http://www.envia.bl.uk/
University of Sheffield, NLP
61
Semantic Enrichment in Envia
University of Sheffield, NLP
Mining medical records
● Medical records contain a large amount of unstructured text
– letters between hospitals and GPs
– discharge summaries
● These documents might contain information not recorded
elsewhere
– it turns out doctors don't like forms!
– often information-specific fields are ignored, with everything
put in the free text area
University of Sheffield, NLP
Medical Records at SLAM
● NIHR Biomedical Research Centre at the South London and
Maudsley Hospital are using text mining in a number of their
studies
● They have developed applications to extract:
– the results of mental state tests, and the date the test was
administered
– education level (high school, university, etc.)
– smoking status
– medication history
● They have even had promising results predicting suicides
University of Sheffield, NLP
Cancer Research
● Genome Wide Association Studies (GWAS) aim to investigate
genetic variants across the whole genome
– With enough cases and controls, this allows them to state
that a given SNP (Single Nucleotide Polymorphism) is
related to a given disease.
– A single study can be very expensive in both time and
money to collect the required samples.
● Can we reduce the costs by analysing published articles to
generate prior probabilities for each SNP?
University of Sheffield, NLP
Can Semantic Annotation Cure Cancer?
● In conjunction with IARC (International Agency for Research on
Cancer, part of the WHO) we developed a text analysis
approach to mine PubMed
● We showed retrospectively that our approach would have saved
over a year's worth of work and more than 1.5 million Euros
● We completed a new study which found a new cause for oral
cancer
– Oral cancer is rare enough that traditional methods would
have failed to find enough cases to make the study plausible
University of Sheffield, NLP
Government Web Archive
● We developed a semantic annotation application to process
every crawled page in the archive.
● Entities annotated included; people, companies, locations,
government departments, ministerial positions, social
documents, dates, money....
● Where possible, annotations were linked to an ontology which
– was based on DBpedia
– was extended with UK government-specific concepts
– included the modelling of the evolution of government
● Annotations were indexed to allow for complex semantic
querying of the collection
● Demo coming later.....
University of Sheffield, NLP
Political Futures Tracker
Where in the UK did Conservative
MPs tweet more about the economy?
University of Sheffield, NLP
5.1 Semantic Annotation
University of Sheffield, NLP
Why ontologies for semantic search?

Semantic annotation: rather than just annotating the word “Cambridge” as a
location, link it to an ontology instance

Differentiate between Cambridge, UK and Cambridge, Mass.

Semantic search via reasoning

So we can infer that this document mentions a city in Europe.

Ontologies tell us that this particular Cambridge is part of the country called
the UK, which is part of the continent Europe.

Knowledge source

If I want to annotate strikes in baseball reports, the ontology will tell me that
a strike involves a batter who is a person

In the text “BA went on strike”, using the knowledge that BA is a company
and not a person, the IE system can conclude that this is not the kind of
strike it is interested in
University of Sheffield, NLP
70
More semantic search examples
Q: {ScalarValue}{MeasurementUnit} ->
A: “12 cm”, “190 g”, “two hours”
---
Q: {Reference} ->
A: JP-A-60-180889
A: Kalderon et al. (1984) Cell 39:499-509
University of Sheffield, NLP
71
Example Semantic Search Architecture
University of Sheffield, NLP
What is Semantic Annotation?
Annotation:
The process of adding metadata to [parts of] a document.
Semantic Annotation:
Annotation process where [parts of] the annotation schema
(annotation types, annotation features) are ontological objects.
University of Sheffield, NLP
Semantic Annotation
University of Sheffield, NLP
What is an Ontology?

Set of concepts (instances and
classes)

Relationships between them (is-a,
part-of, located-in)

Multiple inheritance

Classes can have more than one
parent

Instances can have more than one
class

Ontologies are graphs, not trees
University of Sheffield, NLP
URIs, Labels, Comments

The names of a classes or instance shown are URIs
(Universal Resource Identifier)

http://dbpedia.org/page/Paris

The linguistic lexicalisation is typically encoded in the label
property, as a string

The comment property is often used
for documentation purposes, similarly
a string

Comments and labels are
annotation properties
University of Sheffield, NLP
Datatype Properties

Datatype properties link individuals to data values

Datatype properties can be of type boolean, date, int,

e.g. a person can have an age property

Available datatypes taken from XMLSchema
http://dbpedia.org/page/Paris
University of Sheffield, NLP
Object Properties

Object properties link instances together

They describe relationships between instances, e.g. people work for
organisations

Domain is the subject of the relation (the thing it applies to)

Range is the object of the relation (the possible”values”)
Similar to domains, multiple alternative ranges can be specified by using a
class description of the owl:unionOf, but this raises the complexity of the
ontology and makes reasoning harder
Person work_for Organisation
Domain Property Range
University of Sheffield, NLP
DBpedia
●
Machine readable knowledge on various entities and topics,
including:
– 410,000 places/locations,
– 310,000 persons
– 140,000 organisations
●
For each entity we have:
– entity name variants (e.g. IBM, Int. Business Machines)
– a textual abstract
– reference(s) to corresponding Wikipedia page(s)
– entity-specific properties (e.g. latitude and longitude for
places)
University of Sheffield, NLP
Example from DBpedia
…
Links to GeoNames
And Freebase
Latitude & Longitude
University of Sheffield, NLP
GeoNames
●
2.8 million populated places
– 5.5 million alternate names
●
Knowledge about NUTS country sub-divisions
– use for enrichment of recognised locations with the implied
higher-level country sub-divisions
●
However, the sheer size of GeoNames creates a lot of ambiguity
during semantic enrichment
●
We use it as an additional knowledge source, but not as a primary
source (DBpedia)
University of Sheffield, NLP
5.2 Semantic Annotation In GATE
(and other tools)
University of Sheffield, NLP
Information Extraction for the Semantic Web
●
Traditional IE is based on a flat structure, e.g. recognising
Person, Location, Organisation, Date, Time etc.
●
For the Semantic Web, we need information in a hierarchical
structure
●
Idea is that we attach semantic metadata to the documents,
pointing to concepts in an ontology
●
Information can be exported as an ontology annotated with
instances, or as text annotated with links to the ontology
University of Sheffield, NLP
John lives in London . He works there for Polar Bear Design .
PERSON LOCATION ORGANISATION
Traditional NE Recognition
University of Sheffield, NLP
John lives in London . He works there for Polar Bear Design .
PER LOC ORG
Co-reference
same_as
University of Sheffield, NLP
John lives in London . He works there for Polar Bear Design .
Relations
PER LOC ORG
live_in
University of Sheffield, NLP
John lives in London . He works there for Polar Bear Design .
Relations (2)
PER LOC ORG
employee_of
University of Sheffield, NLP
John lives in London . He works there for Polar Bear Design .
Relations (3)
PER LOC ORG
based_in
University of Sheffield, NLP
Richer NE Tagging
●
Attachment of instances in
the text to concepts in the
domain ontology
●
Disambiguation of
instances, e.g. Cambridge,
MA vs Cambridge, UK
University of Sheffield, NLP
Ontology-based IE
John lives in London. He works there for Polar Bear Design.
University of Sheffield, NLP
Ontology-based IE (2)
John lives in London. He works there for Polar Bear Design.
University of Sheffield, NLP
Typical Semantic Annotation pipeline
Analyse document structure
Linguistic Pre-processing
Ontology Lookup
Ontology-based IE
Populate ontology (optional)
NE recognition
Corpus
Export as RDF
RDF
University of Sheffield, NLP
OpenCalais
http://viewer.opencalais.com/
Paste text of http://www.membranes.com/
Not easily customised/extended
Domain-specific coverage varies
University of Sheffield, NLP
Zemanta
●
Demo at
http://blog.zemanta.com/de
mo/
●
Paste text from
www.membranes.com
●
The main entity of interest
“Hydranautics” is missed
●
Common problem with
general purpose, open-
domain semantic annotation
tools
●
Best results require
bespoke customisation
http://www.zemanta.com/demo/
University of Sheffield, NLP
Semantic Annotation in GATE
●
Insert GATE screenshot
University of Sheffield, NLP
Automatic Semantic Annotation
●
Locations (linked to DBpedia and GeoNames)
– Annotate the place name itself (e.g. Norwich) with the
corresponding DBpedia and GeoNames URIs
– Also use knowledge of the implied reference to the levels 1, 2,
and 3 sub-divisions from the Nomenclature of Territorial Units for
Statistics (NUTS).
– For Norwich, these are East of England (UKH – level 1), East
Anglia (UKH1 – level 2), and Norfolk (UKH13 – level 3).
– Similarly use knowledge to retrieve nearby places
University of Sheffield, NLP
“South Gloucestershire” Example
University of Sheffield, NLP
Semantic Annotation (2)
●
Organisations (linked to DBpedia)
– Names of companies, government organisations, committees,
agencies, universities, and other organisations
●
Dates
– Absolute (e.g. 31/03/2012) and relative (yesterday)
●
Measurements and Percentages
– e.g. 8,596 km2 , 1 km, one fifth, 10%
University of Sheffield, NLP
The YODIE Service from GATE
http://demos.gate.ac.uk/trendminer/obie/
98
http://demos.gate.c.uk/trendminer/obie/
University of Sheffield, NLP
Recognition of term variants in Decarbonet
University of Sheffield, NLP
ClimaTerm Web service
● http://services.gate.ac.uk/decarbonet/term-recognition/
● Input: a document
● Output: the text as an XML file annotated with term and URI
<Document>
<TermInstance="
http://www.eionet.europa.eu/gemet/concept/1148">cars
</Term> pollute by emitting
<Term Instance="http://reegle.info/glossary/1064">carbon
dioxide</Term>
</Document>
“Cars pollute by emitting carbon
dioxide.”
University of Sheffield, NLP
Term recognition demo
University of Sheffield, NLP
Semantic Search: An Overview
University of Sheffield, NLP
GATE Mímir
• can be used to index and search over text, annotations,
semantic metadata (concepts and instances)
• allows queries that arbitrarily mix full-text, structural, linguistic
and semantic annotations
• is open source
University of Sheffield, NLP
What can GATE Mímir do that Google can't?
Show me:
• all documents mentioning a temperature between 30 and 90
degrees F (expressed in any unit)
• all abstracts written in French on patent documents from the last
6 months which mention any form of the word “transistor” in the
English abstract
• the names of the patent inventors of those abstracts
• all documents mentioning steel industries in the UK, along with
their location
University of Sheffield, NLP
Search News Articles for
Politicians born in Sheffield
http://demos.gate.ac.uk/mimir/gpd/search/gus
University of Sheffield, NLP
Easily Create Your Own
Custom GATE Mímir Interfaces
http://demos.gate.ac.uk/pin/
University of Sheffield, NLP
MIMIR: Searching Text Mining Results
●
Searching and managing text annotations, semantic
information, and full text documents in one search engine
●
Queries over annotation graphs
●
Regular expressions, Kleene operators
●
Designed to be integrated as a web service in custom end-user
systems with bespoke interfaces
●
Demos at http://services.gate.ac.uk/mimir/
University of Sheffield, NLP
Modelling Text Annotations and Additional
Knowledge Sources
University of Sheffield, NLP
MIMIR: the technical bit
●
Uses a combination of an FTS engine (MG4J), a database for
indexing the semantic annotations (H2), and a knowledge repository
(OWLIM/Sesame)
●
Use database to indexing annotation kinds, then the additional
linguistic data for words/annotations are stored in MG4J (POS,
morphological root, etc)
●
Semantic query: against the database/ knowledge base for the
additional knowledge
●
A query: decompose into the respective parts, query the appropriate
component, then merge the results
University of Sheffield, NLP
Clustering and Scalability
●
All searches are local to a document, so we can use clusters of
semantic repositories:
●
Faster indexing (less data in each repository)
●
Faster searching (search space is broken into slices that are
searched in parallel – like Google).
●
Joining results is trivial: union of result sets.
●
Simple scalability: just add more nodes.
●
Search speed stays almost constant while the data increases (each
individual repository has the same amount of data; there are just
more repositories)
University of Sheffield, NLP
Ranking of Results
●
By default, documents are returned in index order
●
The following ranking algorithms are also available (others can be
implemented via plugins if required)
– Count
– Hit length
– TF.IDF
– BM25
• These algorithms treat annotations within the queries in the same way
as words, using index level frequencies etc.
• The ranking algorithm can be changed without re-indexing, allowing for
easy experimentation
University of Sheffield, NLP
Scaling Up
●
We annotated 1.08 million web pages using a GATE language
analysis pipeline.
– Documents crawled using Heritrix10 , with total content size of
57 GiB or 6.6 billions plain text characters.
– The indexing server has 2 Intel Xeon 2.8GHz CPUs 11 GB of
RAM, and runs 64 bit Ubuntu Linux. Indexing process was 94
hours.
●
We also indexed 150 million similar web pages, using two hundred
Amazon EC2 Large Instances running for a week to produce a
federated index
●
Mimir runs on GateCloud.net, so easy to scale up
University of Sheffield, NLP
Search for string “Harriet Harman”
University of Sheffield, NLP
Search for string “Harriet Harman says”
University of Sheffield, NLP
Search with morphological variants:
Harriet Harman root:say
University of Sheffield, NLP
Replace strings with NEs: {Person} root:say
University of Sheffield, NLP
{Person} AND root:say – 11803 hits
University of Sheffield, NLP
{Person} [0..5] root: say – 5495 hits
University of Sheffield, NLP
Patent Annotation: Data Model
119
University of Sheffield, NLP
An Example Text
120
University of Sheffield, NLP
Hands-On with patent data
http://demos.gate.ac.uk/mimir/patents/search/index
Text. Matches plain text.
Example: nanomaterial
Linguistic variations of text
Example: (root:nanomaterial | root:nanoparticle )
Annotation. Matches semantic annotations.
Syntax: {Type feature1=value1 feature2=value2...}
Example: {Abstract lang="DE"}
Sequence Query. Sequence of other queries.
Syntax: Query1 [n..m] Query2...
Example: from {Measurement} [1..5] {Measurement}
University of Sheffield, NLP
Inclusion Queries
IN Query. Hits of one query only if inside another.
Syntax: Query1 IN Query2
Example: (root:nanomaterial | root:nanoparticle ) IN {Abstract}
Number of times these words are mentioned in patent abstracts (as
well as links to the actual documents)
OVER Query. Hits of a query, only if overlapping hits of another.
Syntax: Query1 OVER Query2
Example: {Abstract} OVER (root:nanomatrial | root:nanoparticle )
Finds all abstracts that contain nanomaterial(s) or nanoparticle(s)
University of Sheffield, NLP
Date restrictions
(
{Abstract lang="EN"} OVER
(root:nanomaterial | root:nanoparticle )
)
IN
{PatentDocument date > 20050000}
YYYYMMDD
University of Sheffield, NLP
Find references to literature or patents in
the prior art or background sections, which
contain nanomaterial/nanoparticle
({Reference type="Literature"}
|
{Reference type="Patent"}
) IN
({Section type="PriorArt"}
|
{Section type="BackgroundArt"}
)
OVER
(root:nanomaterial | root:nanoparticle)
University of Sheffield, NLP
Queries Using External Knowledge
{Measurement spec="1 to 100 volts“}
●
Uses GNU Units (http://www.gnu.org/software/units/) to convert
measurements and normalise them to ISI units
{Measurement spec="1 to 100 kg m^2 / A s^3"}
●
Example hits: 10 volts, 2V, +20 and -20 volts; ±10V; +/- 100V; +3.3
volts
{Measurement spec="1 to 100 m / s"}
●
Example hits: 40 km/hr, 60m/min, 100cm/sec, 60 fps; 10 to 2000
cm/sec
University of Sheffield, NLP
Searching LOD with SPARQL
• SQL-like query language for RDF data
• Simple protocol for querying remote databases over HTTP
• Query types
• select: projections of variables and expressions
• construct: create triples (or graphs) based on query results
• ask: whether a query returns results (result is true/false)
• describe: describes resources in the graph
University of Sheffield, NLP
SPARQL Example
# Software companies founded in the US
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX dbp-ont: <http://dbpedia.org/ontology/>
PREFIX geo-ont: <http://www.geonames.org/ontology#>
PREFIX umbel-sc: <http://umbel.org/umbel/sc/>
SELECT DISTINCT ?Company ?Location
WHERE {
?Company rdf:type dbp-ont:Company ;
dbp-ont:industry dbpedia:Computer_software ;
dbp-ont:foundationPlace ?Location .
?Location geo-ont:parentFeature dbpedia:United_States .
}
University of Sheffield, NLP
SPARQL Results
●
Try at: http://factforge.net/sparql
University of Sheffield, NLP
{Person sparql = "SELECT ?inst WHERE { ?inst :birthPlace
<http://dbpedia.org/resource/Sheffield>}"}
Documents mentioning Persons born in Sheffield
The BBC web document does not mention Sheffield at all:
http://www.bbc.co.uk/news/uk-politics-11494915
The relevant text snippet is:
University of Sheffield, NLP
Try these on the BBC News demo:
http://services.gate.ac.uk/mimir/gpd/search/index
● Gordon Brown [0..3] root:say
● {Person} [0..3] root:say
● {Person inst ="http://dbpedia.org/resource/Gordon_Brown"}
[0..3] root:say
● {Person sparql="SELECT ?inst WHERE { ?inst a :Politician }"}
[0..3] root:say
● {Person sparql = "SELECT ?inst WHERE { ?inst :party
<http://dbpedia.org/resource/Labour_Party_%28UK%29> }" }
[0..3] root:say
● {Person sparql = "SELECT ?inst WHERE {
?inst :party <http://dbpedia.org/resource/Labour_Party_
%28UK%29> .
?inst :almaMater
<http://dbpedia.org/resource/University_of_Edinburgh> }"
} [0..3] root:say
University of Sheffield, NLP
User Interfaces for SPARQL-based Semantic
Search
• SPARQL-based semantic searches, tapping into LOD resources are
extremely powerful
• However, impossible to write by the vast majority of users
• User interfaces for SPARQL-based semantic search:
• Faceted searches (see ExoPatent next)
• Form-based searches (see EnviLOD, PIN)
• Text-based searches (natural language interfaces for querying
ontologies), e.g. FREyA
University of Sheffield, NLP
Faceted Search: ExoPatent Example
●
Use semantic information to expose linkages between
documents based on the intersecting relationships between
various sets of data from
●
FDA Orange Book (23,000 patented drugs
●
Unified Medical Language System (UMLS) – database of medical
terms (370,000)
●
Patent bibliographic information
●
Search for diseases, drug names, body parts, references to
literature and other patents, numeric values, ranges
●
Demo uses a small set of patents (40,000)
University of Sheffield, NLP
ExoPatent: Faceted Search
University of Sheffield, NLP
Annotated Patent Document
University of Sheffield, NLP
Find all applicants who filed patents related
to mitochondria, as well as drug names and
active ingredients
University of Sheffield, NLP
Semantic Search over Content and Annotations
http://demos.gate.ac.uk/trendminer/envilod
University of Sheffield, NLP
EnviLOD Semantic Search UI
University of Sheffield, NLP
EnviLOD Semantic Search UI
University of Sheffield, NLP
EnviLOD Semantic Search UI
University of Sheffield, NLP
Example Results
University of Sheffield, NLP
…and the underlying SPARQL Query
{Sem_Location dbpediaSparql="select distinct ?inst
where {
{{ ?inst <http://dbpedia.org/property/north> ?loc} UNION
{ ?inst <http://dbpedia.org/property/east> ?loc } UNION
{ ?inst <http://dbpedia.org/property/west> ?loc } UNION
{ ?inst <http://dbpedia.org/property/south> ?loc } UNION
{ ?inst <http://dbpedia.org/property/northeast> ?loc } UNION
{ ?inst <http://dbpedia.org/property/northwest> ?loc } UNION
{ ?inst <http://dbpedia.org/property/southeast> ?loc } UNION
{ ?inst <http://dbpedia.org/property/southwest> ?loc }
FILTER(REGEX(STR(?loc), "Sheffield", "i"))} }"} AND (root:"flood")
Ongoing work: Use GeoSparql instead and be able to specify distances and
reason with the richer information in GeoNames
University of Sheffield, NLP
Flooding in Oxford
University of Sheffield, NLP
Some Caveats….
●
EnviLOD demonstrator built in a 6 month project
●
VERY limited environmental science content indexed
– Some Defra, Environment Agency, Scottish Government reports
●
PDFs of the articles are not connected to
University of Sheffield, NLP
We don't just have to look at politicians saying and
measuring things
● If we first process the text with other NLP tools such as
sentiment analysis, we can also search for positive or negative
documents
● Or positive/negative comments about certain
people/topics/things/companies
● In the Decarbonet project, we're looking at people's opinions
about topics relating to climate change, e.g. fracking
● We could index on the sentiment annotations too
● Other people are using the combination of opinion mining and
MIMIR to look at e.g. customer feedback about products /
companies on a huge corpus
University of Sheffield, NLP
A positive tweet
University of Sheffield, NLP
A negative tweet
University of Sheffield, NLP
A Sarcastic Tweet
University of Sheffield, NLP
Explicitly Choosing The Search Classes
University of Sheffield, NLP
Choosing A Specific Instance
University of Sheffield, NLP
What diseases are in these documents?
University of Sheffield, NLP
What Pathogens?
University of Sheffield, NLP
Disease vs Disease Co-ocurrences
University of Sheffield, NLP
Diseases vs Pathogens
University of Sheffield, NLP
Summary
● Text mining is a very useful pre-requisite for doing all kinds of
more interesting things, like semantic search
● Semantic web search allows you to do much more interesting
kinds of search than a standard text-based search
● Text mining is hard, so it won't always be correct, however
● This is especially true on lower quality text such as social
media
● On the other hand, social media has some of the most
interesting material to look at
● Still plenty of work to be done, but there are lots of tools that
you can use now and get useful results
● We run annual GATE training courses where you can spend a
whole week learning all this and more!
University of Sheffield, NLP
Acknowledgements and further information
● Research partially supported by the European Union/EU under the Information
and Communication Technologies (ICT) theme of the 7th Framework Programme
for R&D (FP7) DecarboNet (610829) http://www.decarbonet.eu and TrendMiner
(287863), Nesta
● PFT: https://gate.ac.uk/projects/pft/
● Semantic Search over Documents and Ontologies (2014) K Bontcheva, V Tablan,
H Cunningham. Bridging Between Information Retrieval and Databases, 31-53
● V. Tablan, K. Bontcheva, I. Roberts, and H. Cunningham. Mímir: An open-source
semantic search framework for interactive information seeking and discovery.
Journal of Web Semantics: Science, Services and Agents on the World Wide
Web, 2014
● These and other relevant papers on the GATE website:
http://gate.ac.uk/gate/doc/papers.html
● Annual GATE training course in June in Sheffield:
https://gate.ac.uk/family/training.html
● Download gate: http://gate.ac.uk/download

More Related Content

What's hot

Knowledge, Graphs & 3D CAD Systems - David Bigelow @ GraphConnect Chicago 2013
Knowledge, Graphs & 3D CAD Systems - David Bigelow @ GraphConnect Chicago 2013Knowledge, Graphs & 3D CAD Systems - David Bigelow @ GraphConnect Chicago 2013
Knowledge, Graphs & 3D CAD Systems - David Bigelow @ GraphConnect Chicago 2013
Neo4j
 

What's hot (20)

Traversing Graphs with Gremlin
Traversing Graphs with GremlinTraversing Graphs with Gremlin
Traversing Graphs with Gremlin
 
Drug and Vaccine Discovery: Knowledge Graph + Apache Spark
Drug and Vaccine Discovery: Knowledge Graph + Apache SparkDrug and Vaccine Discovery: Knowledge Graph + Apache Spark
Drug and Vaccine Discovery: Knowledge Graph + Apache Spark
 
Knowledge, Graphs & 3D CAD Systems - David Bigelow @ GraphConnect Chicago 2013
Knowledge, Graphs & 3D CAD Systems - David Bigelow @ GraphConnect Chicago 2013Knowledge, Graphs & 3D CAD Systems - David Bigelow @ GraphConnect Chicago 2013
Knowledge, Graphs & 3D CAD Systems - David Bigelow @ GraphConnect Chicago 2013
 
Försäkringskassan: Neo4j as an Information Hub (GraphSummit Stockholm 2023)
Försäkringskassan: Neo4j as an Information Hub (GraphSummit Stockholm 2023)Försäkringskassan: Neo4j as an Information Hub (GraphSummit Stockholm 2023)
Försäkringskassan: Neo4j as an Information Hub (GraphSummit Stockholm 2023)
 
Knowledge Graphs are Worthless, Knowledge Graph Use Cases are Priceless
Knowledge Graphs are Worthless, Knowledge Graph Use Cases are PricelessKnowledge Graphs are Worthless, Knowledge Graph Use Cases are Priceless
Knowledge Graphs are Worthless, Knowledge Graph Use Cases are Priceless
 
Data science Big Data
Data science Big DataData science Big Data
Data science Big Data
 
Explainable AI (XAI)
Explainable AI (XAI)Explainable AI (XAI)
Explainable AI (XAI)
 
The Art of Discovering Bounded Contexts
The Art of Discovering Bounded ContextsThe Art of Discovering Bounded Contexts
The Art of Discovering Bounded Contexts
 
Explainable AI (XAI) - A Perspective
Explainable AI (XAI) - A Perspective Explainable AI (XAI) - A Perspective
Explainable AI (XAI) - A Perspective
 
Visualisation & Storytelling in Data Science & Analytics
Visualisation & Storytelling in Data Science & AnalyticsVisualisation & Storytelling in Data Science & Analytics
Visualisation & Storytelling in Data Science & Analytics
 
Josh Cavalier - ChatGPT Prompt Strategies.pdf
Josh Cavalier - ChatGPT Prompt Strategies.pdfJosh Cavalier - ChatGPT Prompt Strategies.pdf
Josh Cavalier - ChatGPT Prompt Strategies.pdf
 
Combining a Knowledge Graph and Graph Algorithms to Find Hidden Skills at NASA
Combining a Knowledge Graph and Graph Algorithms to Find Hidden Skills at NASACombining a Knowledge Graph and Graph Algorithms to Find Hidden Skills at NASA
Combining a Knowledge Graph and Graph Algorithms to Find Hidden Skills at NASA
 
CFPB Consumer Complaints Report - Tableau
CFPB Consumer Complaints Report - TableauCFPB Consumer Complaints Report - Tableau
CFPB Consumer Complaints Report - Tableau
 
Hegazi_ChatGPT_Book.pdf
Hegazi_ChatGPT_Book.pdfHegazi_ChatGPT_Book.pdf
Hegazi_ChatGPT_Book.pdf
 
Understanding LLMOps-Large Language Model Operations
Understanding LLMOps-Large Language Model OperationsUnderstanding LLMOps-Large Language Model Operations
Understanding LLMOps-Large Language Model Operations
 
The three layers of a knowledge graph and what it means for authoring, storag...
The three layers of a knowledge graph and what it means for authoring, storag...The three layers of a knowledge graph and what it means for authoring, storag...
The three layers of a knowledge graph and what it means for authoring, storag...
 
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
Knowledge Graphs and Graph Data Science: More Context, Better Predictions (Ne...
 
What_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdfWhat_do_Knowledge_Graph_Embeddings_Learn.pdf
What_do_Knowledge_Graph_Embeddings_Learn.pdf
 
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...Knowledge Graph Generation  from Wikipedia in the Age of ChatGPT:  Knowledge ...
Knowledge Graph Generation from Wikipedia in the Age of ChatGPT: Knowledge ...
 
Learn Prompting with ChatGPT
Learn Prompting with ChatGPTLearn Prompting with ChatGPT
Learn Prompting with ChatGPT
 

Viewers also liked

Semantic Search Engines
Semantic Search EnginesSemantic Search Engines
Semantic Search Engines
Atul Shridhar
 
Intriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platformIntriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platform
toncho11
 

Viewers also liked (20)

Text analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEText analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATE
 
GATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaGATE: a text analysis tool for social media
GATE: a text analysis tool for social media
 
Tools for (Almost) Real-Time Social Media Analysis
Tools for (Almost) Real-Time Social Media AnalysisTools for (Almost) Real-Time Social Media Analysis
Tools for (Almost) Real-Time Social Media Analysis
 
Semantic web
Semantic webSemantic web
Semantic web
 
sent_analysis_report
sent_analysis_reportsent_analysis_report
sent_analysis_report
 
Who cares about sarcastic tweets? Investigating the impact of sarcasm on sent...
Who cares about sarcastic tweets? Investigating the impact of sarcasm on sent...Who cares about sarcastic tweets? Investigating the impact of sarcasm on sent...
Who cares about sarcastic tweets? Investigating the impact of sarcasm on sent...
 
NLP Asignment Final Presentation [IIT-Bombay]
NLP Asignment Final Presentation [IIT-Bombay]NLP Asignment Final Presentation [IIT-Bombay]
NLP Asignment Final Presentation [IIT-Bombay]
 
Semantic Search Engines
Semantic Search EnginesSemantic Search Engines
Semantic Search Engines
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
 
Adding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to DeliveryAdding Semantic Edge to Your Content – From Authoring to Delivery
Adding Semantic Edge to Your Content – From Authoring to Delivery
 
Ontological approach for improving semantic web search results
Ontological approach for improving semantic web search resultsOntological approach for improving semantic web search results
Ontological approach for improving semantic web search results
 
Intriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platformIntriduction to Ontotext's KIM platform
Intriduction to Ontotext's KIM platform
 
Sarcasm & Thwarting in Sentiment Analysis [IIT-Bombay]
Sarcasm & Thwarting in Sentiment Analysis [IIT-Bombay]Sarcasm & Thwarting in Sentiment Analysis [IIT-Bombay]
Sarcasm & Thwarting in Sentiment Analysis [IIT-Bombay]
 
A Taxonomy of Semantic Web data Retrieval Techniques
A Taxonomy of Semantic Web data Retrieval TechniquesA Taxonomy of Semantic Web data Retrieval Techniques
A Taxonomy of Semantic Web data Retrieval Techniques
 
Python NLTK
Python NLTKPython NLTK
Python NLTK
 
In Search of a Semantic Book Search Engine: Are We There Yet?
In Search of a Semantic Book Search Engine: Are We There Yet?In Search of a Semantic Book Search Engine: Are We There Yet?
In Search of a Semantic Book Search Engine: Are We There Yet?
 
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information RetrievalKeystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
 
Semantics And Search
Semantics And SearchSemantics And Search
Semantics And Search
 
Semantic data mining: an ontology based approach
Semantic data mining: an ontology based approachSemantic data mining: an ontology based approach
Semantic data mining: an ontology based approach
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 

Similar to Text Analysis and Semantic Search with GATE

Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdfData+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
neelakandan2001kpm
 

Similar to Text Analysis and Semantic Search with GATE (20)

Text analysis-semantic-search
Text analysis-semantic-searchText analysis-semantic-search
Text analysis-semantic-search
 
Arcomem training entities-and-events_advanced
Arcomem training entities-and-events_advancedArcomem training entities-and-events_advanced
Arcomem training entities-and-events_advanced
 
Entity Search Engine
Entity Search Engine Entity Search Engine
Entity Search Engine
 
Veda Semantics - introduction document
Veda Semantics - introduction documentVeda Semantics - introduction document
Veda Semantics - introduction document
 
Arcomem training opinions_advanced
Arcomem training opinions_advancedArcomem training opinions_advanced
Arcomem training opinions_advanced
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introduction
 
data science and business analytics
data science and business analyticsdata science and business analytics
data science and business analytics
 
ChatGPT in academic settings H2.de
ChatGPT in academic settings H2.deChatGPT in academic settings H2.de
ChatGPT in academic settings H2.de
 
Natural Language Processing Use Cases for Business Optimization
Natural Language Processing Use Cases for Business OptimizationNatural Language Processing Use Cases for Business Optimization
Natural Language Processing Use Cases for Business Optimization
 
What can Natural Language Processing do for you?
What can Natural Language Processing do for you?What can Natural Language Processing do for you?
What can Natural Language Processing do for you?
 
Data Science Unit1 AMET.pdf
Data Science Unit1 AMET.pdfData Science Unit1 AMET.pdf
Data Science Unit1 AMET.pdf
 
Lunch and Learn Artificial intelligence
Lunch and Learn Artificial intelligence Lunch and Learn Artificial intelligence
Lunch and Learn Artificial intelligence
 
Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...
Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...
Copy of OTel Me All About OpenTelemetry The Current & Future State, Navigatin...
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...
 
Natural Language Processing .pdf
Natural Language Processing .pdfNatural Language Processing .pdf
Natural Language Processing .pdf
 
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位
 
Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdfData+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
 
Machine Learning Product Managers Meetup Event
Machine Learning Product Managers Meetup EventMachine Learning Product Managers Meetup Event
Machine Learning Product Managers Meetup Event
 
Data Science - Experiments
Data Science - ExperimentsData Science - Experiments
Data Science - Experiments
 
Natural language processing and search
Natural language processing and searchNatural language processing and search
Natural language processing and search
 

More from Diana Maynard

Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?
Diana Maynard
 

More from Diana Maynard (20)

Filth and lies: analysing social media
Filth and lies: analysing social mediaFilth and lies: analysing social media
Filth and lies: analysing social media
 
Adding value to NLP: a little semantics goes a long way
Adding value to NLP: a little semantics goes a long wayAdding value to NLP: a little semantics goes a long way
Adding value to NLP: a little semantics goes a long way
 
Methodological possibilities for strengthening the monitoring of SDG indicato...
Methodological possibilities for strengthening the monitoring of SDG indicato...Methodological possibilities for strengthening the monitoring of SDG indicato...
Methodological possibilities for strengthening the monitoring of SDG indicato...
 
Getting the-most-out-of-conferences
Getting the-most-out-of-conferencesGetting the-most-out-of-conferences
Getting the-most-out-of-conferences
 
Understanding the world with NLP: interactions between society, behaviour and...
Understanding the world with NLP: interactions between society, behaviour and...Understanding the world with NLP: interactions between society, behaviour and...
Understanding the world with NLP: interactions between society, behaviour and...
 
The language of social media
The language of social mediaThe language of social media
The language of social media
 
Using language to save the world: interactions between society, behaviour and...
Using language to save the world: interactions between society, behaviour and...Using language to save the world: interactions between society, behaviour and...
Using language to save the world: interactions between society, behaviour and...
 
Adapting NLP tools to diverse data: challenges and solutions
Adapting NLP tools to diverse data: challenges and solutionsAdapting NLP tools to diverse data: challenges and solutions
Adapting NLP tools to diverse data: challenges and solutions
 
Ontologies as bridges between data sources and user queries: the KNOWMAK proj...
Ontologies as bridges between data sources and user queries: the KNOWMAK proj...Ontologies as bridges between data sources and user queries: the KNOWMAK proj...
Ontologies as bridges between data sources and user queries: the KNOWMAK proj...
 
20 Years of Text Mining Applications with GATE: from Donald Trump to curing c...
20 Years of Text Mining Applications with GATE: from Donald Trump to curing c...20 Years of Text Mining Applications with GATE: from Donald Trump to curing c...
20 Years of Text Mining Applications with GATE: from Donald Trump to curing c...
 
Challenges of social media analysis in the real world
Challenges of social media analysis in the real worldChallenges of social media analysis in the real world
Challenges of social media analysis in the real world
 
Social media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATESocial media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATE
 
Cls8 decarbonet
Cls8 decarbonetCls8 decarbonet
Cls8 decarbonet
 
Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?
 
Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?Can Social Media Analysis Improve Collective Awareness of Climate Change?
Can Social Media Analysis Improve Collective Awareness of Climate Change?
 
Do we really know what people mean when they tweet?
Do we really know what people mean when they tweet?Do we really know what people mean when they tweet?
Do we really know what people mean when they tweet?
 
Disability and Adventure Travel: the Double-Edged Sword
Disability and Adventure Travel: the Double-Edged SwordDisability and Adventure Travel: the Double-Edged Sword
Disability and Adventure Travel: the Double-Edged Sword
 
Multimodal opinion mining from social media
Multimodal opinion mining from social mediaMultimodal opinion mining from social media
Multimodal opinion mining from social media
 
Practical Opinion Mining for Social Media
Practical Opinion Mining for Social MediaPractical Opinion Mining for Social Media
Practical Opinion Mining for Social Media
 
What do you really mean when you tweet? Challenges for opinion mining on soci...
What do you really mean when you tweet? Challenges for opinion mining on soci...What do you really mean when you tweet? Challenges for opinion mining on soci...
What do you really mean when you tweet? Challenges for opinion mining on soci...
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Text Analysis and Semantic Search with GATE

  • 1. University of Sheffield, NLP Text Analysis with GATE Diana Maynard University of Sheffield Search Solutions Tutorial 2015 London, UK © The University of Sheffield, 2015 This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike Licence
  • 2. University of Sheffield, NLP Outline of the Tutorial • Introduction to NLP and Information Extraction • Introduction to GATE • Social media analysis • Text Analysis for Semantic Search • Semantic Search • Semantic Annotation • GATE MIMIR • Example applications • ExoPatent, TNA Web Archive, BBC News, Envilod, Prospector, Political Futures Tracker
  • 3. University of Sheffield, NLP 1. NLP and Information Extraction
  • 4. University of Sheffield, NLP Oddly enough, people have successfully combined information and toast...
  • 5. University of Sheffield, NLP The weather-forecasting toaster ● This weather-forecasting toaster, connected to a phone point, was designed in 2001 by a PhD student ● It accessed the MetOffice website via a modem inside the toaster and translated the information into a 1, 2 or 3 for rain, cloud or sun ● The relevant symbol was then branded into the toast in the last few seconds of toasting
  • 6. University of Sheffield, NLP However, toast isn't actually a very good medium for finding information
  • 7. 7 University of Sheffield, NLP Information overload or filter failure? ● We all know that there's masses of information available online these days, and it's constantly growing ● You often hear people talk about “information overload” ● But the real problem is not the amount of information, but our inability to filter it correctly ● Clay Shirky has an excellent talk on this topic http://bit.ly/oWJTNZ
  • 8. University of Sheffield, NLP NLP gives us a way to understand data ● sort the data to remove the rubbish from the interesting parts ● extract the relevant pieces of information ● link the extracted information to other sources of information (e.g. from DBpedia) ● aggregate the information according to potential new categories ● query the (aggregated) information ● visualise the results of the query
  • 9. University of Sheffield, NLP What is information extraction? ● The automatic discovery of new, previously unknown information, by automatically extracting information from different textual resources. ● A key element is the linking together of the extracted information to form new facts or new hypotheses to be explored further ● In IE, the goal is to discover previously unknown information, i.e. something that no one yet knows
  • 10. 10 University of Sheffield, NLP IE is not Data Mining Examples: • using consumer purchasing patterns to predict which products to place close together on shelves in supermarkets • analysing spending patterns on credit cards to detect fraudulent card use. Data mining is about using analytical techniques to find interesting patterns from large structured databases
  • 11. 11 University of Sheffield, NLP IE is not Web Search ● IE is also different from traditional web search or IR. ● In search, the user is typically looking for something that is already known and has been written by someone else. ● The problem lies in sifting through all the material that currently isn't relevant to your needs, in order to find the information that is. ● The solution often lies in better ways to ask the right question ● You can't ask Google to tell you about – all the speeches made by Tony Blair about foot and mouth disease while he was Prime Minister – all the documents in which a politician born in Sheffield is quoted as saying something about hospitals.
  • 12. University of Sheffield, NLP GATE Mímir: Answering Questions Google Can't More about how we can do this will be revealed later...
  • 13. University of Sheffield, NLP Information Extraction Basics ● Entity recognition is required for ● Relation extraction which is required for ● Event recognition which is required for ● Summarisation tasks (and other things)
  • 14. University of Sheffield, NLP What is Entity Recognition? ● Entity Recognition is about recognising and classifying key Named Entities and terms in the text ● A Named Entity is a Person, Location, Organisation, Date etc. ● A term is a key concept or phrase that is representative of the text ● Entities and terms may be described in different ways but refer to the same thing. We call this co-reference. Mitt Romney, the favorite to win the Republican nomination for president in 2012 DatePerson Term The GOP tweeted that they had knocked on 75,000 doors in Ohio the day prior. Organisation co-reference Location
  • 15. University of Sheffield, NLP What is Event Recognition? ● An event is an action or situation relevant to the domain expressed by some relation between entities or terms. ● It is always grounded in time, e.g. the performance of a band, an election, the death of a person Mitt Romney, the favorite to win the Republican nomination for president in 2012 Event DatePerson Relation Relation
  • 16. University of Sheffield, NLP Why are Entities and Events Useful? ● They can help answer the “Big 5” journalism questions (who, what, when, where, why) ● They can be used to categorise the texts in different ways – look at all texts about Obama. ● They can be used as targets for opinion mining – find out what people think about President Obama ● When linked to an ontology and/or combined with other information, they can be used for reasoning about things not explicit in the text – seeing how opinions about different American presidents have changed over the years
  • 17. 17 University of Sheffield, NLP NLP components for text mining ● A text mining system is usually built up from a number of different NLP components ● First, you need some Information Extraction tools to do the donkey work of getting all the relevant pieces of information and facts. ● Then you need some tools to apply the reasoning to the facts, e.g. opinion mining, information aggregation, semantic technologies, dynamics analysis ● GATE is an example of a tool for text mining which allows you to combine all the necessary NLP components
  • 18. 18 University of Sheffield, NLP Low-level linguistic processing components ● These are needed to pre-process the text ready for the more complex IE tasks (NER, relations etc.)
  • 19. University of Sheffield, NLP Approaches to Information Extraction Knowledge Engineering  rule based  developed by experienced language engineers  make use of human intuition  easier to understand results  development could be very time consuming  some changes may be hard to accommodate Learning Systems  use statistics or other machine learning  developers do not need LE expertise  requires large amounts of annotated training data  some changes may require re-annotation of the entire training corpus
  • 21. University of Sheffield, NLP What is GATE? GATE is an NLP toolkit developed at the University of Sheffield over the last 20 years It includes: • components for language processing, e.g. parsers, machine learning tools, stemmers, IR tools, IE components for various languages... • tools for visualising and manipulating text, annotations, ontologies, parse trees, etc. • various information extraction tools • evaluation and benchmarking tools
  • 22. University of Sheffield, NLP GATE components ● Language Resources (LRs), e.g. lexicons, corpora, ontologies ● Processing Resources (PRs), e.g. parsers, generators, taggers ● Visual Resources (VRs), i.e. visualisation and editing components ● Algorithms are separated from the data, which means: – the two can be developed independently by users with different expertise. – alternative resources of one type can be used without affecting the other, e.g. a different visual resource can be used with the same language resource
  • 23. University of Sheffield, NLP ANNIE • ANNIE is GATE's rule-based IE system • It uses the language engineering approach (though we also have tools in GATE for ML) • Distributed as part of GATE • Uses a finite-state pattern-action rule language, JAPE • ANNIE contains a reusable and easily extendable set of components: – generic pre-processing components for tokenisation, sentence splitting etc – components for performing NE on general open domain text
  • 24. University of Sheffield, NLP What's in ANNIE? • The ANNIE application contains a set of core PRs: • Tokeniser • Sentence Splitter • POS tagger • Gazetteers • Named entity tagger (JAPE transducer) • Orthomatcher (orthographic coreference) • There are also other PRs available in the ANNIE plugin, which are not used in the default application, but can be added if necessary • NP and VP chunker, morphological analysis
  • 25. University of Sheffield, NLP ANNIE Modules
  • 26. University of Sheffield, NLP Let's look at the PRs • Each PR in the ANNIE pipeline creates some new annotations, or modifies existing ones • Document Reset → removes annotations • Tokeniser → Token annotations • Gazetteer → Lookup annotations • Sentence Splitter → Sentence, Split annotations • POS tagger → adds category features to Token annotations • NE transducer → Date, Person, Location, Organisation, Money, Percent annotations • Orthomatcher → adds match features to NE annotations
  • 27. University of Sheffield, NLP Tokeniser  Tokenisation based on Unicode classes  Declarative token specification language  Produces Token and SpaceToken annotations with features orthography and kind  Length and string features are also produced  Rule for a lowercase word with initial uppercase letter:    "UPPERCASE_LETTER" LOWERCASE_LETTER"* >    Token; orthography=upperInitial; kind=word
  • 28. University of Sheffield, NLP 28 Document with Tokens
  • 29. University of Sheffield, NLP ANNIE English Tokeniser  The English Tokeniser is a slightly enhanced version of the Unicode tokeniser  It comprises an additional JAPE transducer which adapts the generic tokeniser output for the POS tagger requirements  It converts constructs involving apostrophes into more sensible combinations  don’t → do + n't  you've → you + 've
  • 30. University of Sheffield, NLP Sentence Splitter • The default splitter finds sentences based on Tokens • Creates Sentence annotations and Split annotations on the sentence delimiters • Uses a gazetteer of abbreviations etc. and a set of JAPE grammars which find sentence delimiters and then annotate sentences and splits • There are a few sentence splitter variants available which work in slightly different ways
  • 31. University of Sheffield, NLP 31 Document with Sentences
  • 32. University of Sheffield, NLP POS tagger  ANNIE POS tagger is a Java implementation of Brill's transformation based tagger  Previously known as Hepple Tagger  Trained on WSJ, uses Penn Treebank tagset  Default ruleset and lexicon can be modified manually (with a little deciphering)  Adds category feature to Token annotations  Requires Tokeniser and Sentence Splitter to be run first
  • 33. Morphological analyser  Not an integral part of ANNIE, but can be found in the Tools plugin as an “added extra”  Flex based rules: can be modified by the user (instructions in the User Guide)  Generates “root” feature on Token annotations  Requires Tokeniser to be run first  Requires POS tagger to be run first if the considerPOSTag parameter is set to true
  • 34. University of Sheffield, NLP Gazetteers ● Gazetteers are plain text files containing lists of names (e.g rivers, cities, people, …) ● The lists are compiled into Finite State Machines ● Each gazetteer has an index file listing all the lists, plus features of each list (majorType, minorType and language) ● Lists can be modified either internally using the Gazetteer Editor, or externally in your favourite editor ● Gazetteers generate Lookup annotations with relevant features corresponding to the list matched ● You can also add extra features and values to individual lists ● Lookup annotations are used primarily by the NE transducer ● The ANNIE gazetteer has about 60,000 entries arranged in 80 lists
  • 35. University of Sheffield, NLP Gazetteer editor definition file entries entries for selected list
  • 36. University of Sheffield, NLP Named Entity Recognition • Gazetteers can be used to find terms that suggest entities • However, the entries can often be ambiguous – “May Jones” vs “May 2010” vs “May I be excused?” – “Mr Parkinson” vs “Parkinson's Disease” – “General Motors” vs. “General Smith” • Handcrafted grammars are used to define patterns over the Lookups and other annotations • These patterns can help disambiguate, and they can combine different annotations, e.g. Dates can be comprised of day + number + month
  • 37. University of Sheffield, NLP Named Entity Grammars • Hand-coded rules written in JAPE applied to annotations to identify NEs • Phases run sequentially and constitute a cascade of FSTs over annotations • Annotations from format analysis, tokeniser. splitter, POS tagger, morphological analysis, gazetteer etc. • Because phases are sequential, annotations can be built up over a period of phases, as new information is gleaned • Standard named entities: persons, locations, organisations, dates, addresses, money • Basic NE grammars can be adapted for new applications, domains and languages
  • 38. University of Sheffield, NLP Example of a JAPE pattern-action rule Rule: PersonName ( {Lookup.majorType == firstname} {Token.category == NNP} ):tag --> :tag.Person = {kind = fullName} Look for an entry in a gazetter list of first names followed by a proper noun create an annotation of type “Person” give the annotation the feature kind and value “fullName”
  • 39. University of Sheffield, NLP JAPE rules are usually a bit more complex ● They can make use of any annotations already created, e.g. – strings of text – POS categories – gazetteer lookup – root forms of words – document metadata, e.g. HTML tags – annotations “in context” ● They can use regular expression operators, including negative constraints, “contains”, “within” etc. ● And the RHS of a rule can contain any Java code you like
  • 40. University of Sheffield, NLP Document with NEs
  • 41. Using co-reference  Different expressions may refer to the same entity  Orthographic co-reference module (orthomatcher) matches proper names and their variants in a document  [Mr Smith] and [John Smith] will be matched as the same person  [International Business Machines Ltd.] will match [IBM]
  • 42. Orthomatcher PR • Performs co-reference resolution based on orthographical information of entities • Produces a list of annotation IDs that form a co-reference “chain” • List of such lists stored as a document feature named “MatchesAnnots” • Improves results by assigning entity type to previously unclassified names, based on relations with classified entities • May not reclassify already classified entities • Classification of unknown entities very useful for surnames which match a full name, or abbreviations, e.g. “Bonfield” <Unknown> will match “Sir Peter Bonfield” <Person> • A pronominal PR is also available
  • 43. University of Sheffield, NLP Co-reference in GATE
  • 44. University of Sheffield, NLP Other NLP Toolkits • UIMA • OpenCalais • Lingpipe • OpenNLP • Stanford Tools All integrated into GATE as plugins
  • 45. University of Sheffield, NLP 3. Analysing Social Media
  • 46. University of Sheffield, NLP Analysing language in social media is hard ● Grundman:politics makes #climatechange scientific issue,people don’t like knowitall rational voice tellin em wat 2do ● @adambation Try reading this article , it looks like it would be really helpful and not obvious at all. http://t.co/mo3vODoX ● Want to solve the problem of #ClimateChange? Just #vote for a #politician! Poof! Problem gone! #sarcasm #TVP #99% ● Human Caused #ClimateChange is a Monumental Scam! http://www.youtube.com/watch?v=LiX792kNQeE … F**k yes!! Lying to us like MOFO's Tax The Air We Breath! F**k Them!
  • 47. University of Sheffield, NLP Challenges for NLP ● Noisy language: unusual punctuation, capitalisation, spelling, use of slang, sarcasm etc. ● Terse nature of microposts such as tweets ● Use of hashtags, @mentions etc causes problems for tokenisation #thisistricky ● Lack of context gives rise to ambiguities ● NER performs poorly on microposts, mainly because of linguistic pre-processing failure – Performance of standard IE tools decreases from ~90% to ~40% when run on tweets rather than news articles
  • 48. University of Sheffield, NLP Persons in news articles
  • 49. University of Sheffield, NLP Persons in tweets
  • 50. University of Sheffield, NLP Lack of context causes ambiguity Branching out from Lincoln park after dark ... Hello Russian Navy, it's like the same thing but with glitter! ??
  • 51. University of Sheffield, NLP Getting the NEs right is crucial Branching out from Lincoln park after dark ... Hello Russian Navy, it's like the same thing but with glitter!
  • 52. University of Sheffield, NLP How do we deal with this kind of text? ● Typical NLP pipeline means that degraded performance has a knock-on effect along the chain ● Short sentences confuse language identification tools ● Linguistic processing tools have to be adapted to the domain ● Retraining individual components on large volumes of data ● Adaptation of techniques from e.g. SMS analysis ● Development of new Twitter-specific tools (e.g. GATE's TwitIE) ● But....lack of standards, easily accessible data, common evaluation etc. are holding back development
  • 53. University of Sheffield, NLP TwitIE to the rescue
  • 54. University of Sheffield, NLP 4. Text Analytics for Semantic Search
  • 55. University of Sheffield, NLP Text Search isn't Enough “I like the Internet. Really, I do. Any time I need a piece of shareware or I want to find out the weather in Bogota... I'm the first guy to get the modem humming. But as a source of information, it sucks. You got a billion pieces of data, struggling to be heard and seen and downloaded, and anything I want to know seems to get trampled underfoot in the crowd." Michael Marshall, The Straw Men. HarperCollins Publishers, 2002.
  • 56. University of Sheffield, NLP Semantic Queries in Google
  • 57. University of Sheffield, NLP Searching for Things, Not Strings http://googleblog.blogspot.it/2012/05/introducing-knowledge-graph-things-not.html • 500 million entities that Google “knows” about • Used to provide more accurate search results • Summaries of information about the entity being searched
  • 58. University of Sheffield, NLP Facebook Graph Search
  • 59. University of Sheffield, NLP Semantic Enrichment ● Textual mentions aren't actually that useful in isolation – knowing that something is a “Person" isn't very helpful – knowing which Person the mention refers to can be very useful ● Disambiguating mentions against an ontology provides extra context ● This is where semantic enrichment comes in ● The end product is a set of textual mentions linked to an ontology, otherwise known as semantic annotations ● Annotations on their own can be useful but they can also – be used to generate corpus level statistics – be used for further ontology population – form the basis of summaries – be indexed to provide semantic search
  • 60. University of Sheffield, NLP Automatic Semantic Enrichment • Use Text Mining, e.g. • Information Extraction – recognise names of people, organisations, locations, dates, references, etc. • Term recognition – identify domain-specific terms • Automatically extend article metadata to improve search quality • Example: using a customised GATE text mining pipeline to enrich metadata in the Envia environmental science repository http://www.envia.bl.uk/
  • 61. University of Sheffield, NLP 61 Semantic Enrichment in Envia
  • 62. University of Sheffield, NLP Mining medical records ● Medical records contain a large amount of unstructured text – letters between hospitals and GPs – discharge summaries ● These documents might contain information not recorded elsewhere – it turns out doctors don't like forms! – often information-specific fields are ignored, with everything put in the free text area
  • 63. University of Sheffield, NLP Medical Records at SLAM ● NIHR Biomedical Research Centre at the South London and Maudsley Hospital are using text mining in a number of their studies ● They have developed applications to extract: – the results of mental state tests, and the date the test was administered – education level (high school, university, etc.) – smoking status – medication history ● They have even had promising results predicting suicides
  • 64. University of Sheffield, NLP Cancer Research ● Genome Wide Association Studies (GWAS) aim to investigate genetic variants across the whole genome – With enough cases and controls, this allows them to state that a given SNP (Single Nucleotide Polymorphism) is related to a given disease. – A single study can be very expensive in both time and money to collect the required samples. ● Can we reduce the costs by analysing published articles to generate prior probabilities for each SNP?
  • 65. University of Sheffield, NLP Can Semantic Annotation Cure Cancer? ● In conjunction with IARC (International Agency for Research on Cancer, part of the WHO) we developed a text analysis approach to mine PubMed ● We showed retrospectively that our approach would have saved over a year's worth of work and more than 1.5 million Euros ● We completed a new study which found a new cause for oral cancer – Oral cancer is rare enough that traditional methods would have failed to find enough cases to make the study plausible
  • 66. University of Sheffield, NLP Government Web Archive ● We developed a semantic annotation application to process every crawled page in the archive. ● Entities annotated included; people, companies, locations, government departments, ministerial positions, social documents, dates, money.... ● Where possible, annotations were linked to an ontology which – was based on DBpedia – was extended with UK government-specific concepts – included the modelling of the evolution of government ● Annotations were indexed to allow for complex semantic querying of the collection ● Demo coming later.....
  • 67. University of Sheffield, NLP Political Futures Tracker Where in the UK did Conservative MPs tweet more about the economy?
  • 68. University of Sheffield, NLP 5.1 Semantic Annotation
  • 69. University of Sheffield, NLP Why ontologies for semantic search?  Semantic annotation: rather than just annotating the word “Cambridge” as a location, link it to an ontology instance  Differentiate between Cambridge, UK and Cambridge, Mass.  Semantic search via reasoning  So we can infer that this document mentions a city in Europe.  Ontologies tell us that this particular Cambridge is part of the country called the UK, which is part of the continent Europe.  Knowledge source  If I want to annotate strikes in baseball reports, the ontology will tell me that a strike involves a batter who is a person  In the text “BA went on strike”, using the knowledge that BA is a company and not a person, the IE system can conclude that this is not the kind of strike it is interested in
  • 70. University of Sheffield, NLP 70 More semantic search examples Q: {ScalarValue}{MeasurementUnit} -> A: “12 cm”, “190 g”, “two hours” --- Q: {Reference} -> A: JP-A-60-180889 A: Kalderon et al. (1984) Cell 39:499-509
  • 71. University of Sheffield, NLP 71 Example Semantic Search Architecture
  • 72. University of Sheffield, NLP What is Semantic Annotation? Annotation: The process of adding metadata to [parts of] a document. Semantic Annotation: Annotation process where [parts of] the annotation schema (annotation types, annotation features) are ontological objects.
  • 73. University of Sheffield, NLP Semantic Annotation
  • 74. University of Sheffield, NLP What is an Ontology?  Set of concepts (instances and classes)  Relationships between them (is-a, part-of, located-in)  Multiple inheritance  Classes can have more than one parent  Instances can have more than one class  Ontologies are graphs, not trees
  • 75. University of Sheffield, NLP URIs, Labels, Comments  The names of a classes or instance shown are URIs (Universal Resource Identifier)  http://dbpedia.org/page/Paris  The linguistic lexicalisation is typically encoded in the label property, as a string  The comment property is often used for documentation purposes, similarly a string  Comments and labels are annotation properties
  • 76. University of Sheffield, NLP Datatype Properties  Datatype properties link individuals to data values  Datatype properties can be of type boolean, date, int,  e.g. a person can have an age property  Available datatypes taken from XMLSchema http://dbpedia.org/page/Paris
  • 77. University of Sheffield, NLP Object Properties  Object properties link instances together  They describe relationships between instances, e.g. people work for organisations  Domain is the subject of the relation (the thing it applies to)  Range is the object of the relation (the possible”values”) Similar to domains, multiple alternative ranges can be specified by using a class description of the owl:unionOf, but this raises the complexity of the ontology and makes reasoning harder Person work_for Organisation Domain Property Range
  • 78. University of Sheffield, NLP DBpedia ● Machine readable knowledge on various entities and topics, including: – 410,000 places/locations, – 310,000 persons – 140,000 organisations ● For each entity we have: – entity name variants (e.g. IBM, Int. Business Machines) – a textual abstract – reference(s) to corresponding Wikipedia page(s) – entity-specific properties (e.g. latitude and longitude for places)
  • 79. University of Sheffield, NLP Example from DBpedia … Links to GeoNames And Freebase Latitude & Longitude
  • 80. University of Sheffield, NLP GeoNames ● 2.8 million populated places – 5.5 million alternate names ● Knowledge about NUTS country sub-divisions – use for enrichment of recognised locations with the implied higher-level country sub-divisions ● However, the sheer size of GeoNames creates a lot of ambiguity during semantic enrichment ● We use it as an additional knowledge source, but not as a primary source (DBpedia)
  • 81. University of Sheffield, NLP 5.2 Semantic Annotation In GATE (and other tools)
  • 82. University of Sheffield, NLP Information Extraction for the Semantic Web ● Traditional IE is based on a flat structure, e.g. recognising Person, Location, Organisation, Date, Time etc. ● For the Semantic Web, we need information in a hierarchical structure ● Idea is that we attach semantic metadata to the documents, pointing to concepts in an ontology ● Information can be exported as an ontology annotated with instances, or as text annotated with links to the ontology
  • 83. University of Sheffield, NLP John lives in London . He works there for Polar Bear Design . PERSON LOCATION ORGANISATION Traditional NE Recognition
  • 84. University of Sheffield, NLP John lives in London . He works there for Polar Bear Design . PER LOC ORG Co-reference same_as
  • 85. University of Sheffield, NLP John lives in London . He works there for Polar Bear Design . Relations PER LOC ORG live_in
  • 86. University of Sheffield, NLP John lives in London . He works there for Polar Bear Design . Relations (2) PER LOC ORG employee_of
  • 87. University of Sheffield, NLP John lives in London . He works there for Polar Bear Design . Relations (3) PER LOC ORG based_in
  • 88. University of Sheffield, NLP Richer NE Tagging ● Attachment of instances in the text to concepts in the domain ontology ● Disambiguation of instances, e.g. Cambridge, MA vs Cambridge, UK
  • 89. University of Sheffield, NLP Ontology-based IE John lives in London. He works there for Polar Bear Design.
  • 90. University of Sheffield, NLP Ontology-based IE (2) John lives in London. He works there for Polar Bear Design.
  • 91. University of Sheffield, NLP Typical Semantic Annotation pipeline Analyse document structure Linguistic Pre-processing Ontology Lookup Ontology-based IE Populate ontology (optional) NE recognition Corpus Export as RDF RDF
  • 92. University of Sheffield, NLP OpenCalais http://viewer.opencalais.com/ Paste text of http://www.membranes.com/ Not easily customised/extended Domain-specific coverage varies
  • 93. University of Sheffield, NLP Zemanta ● Demo at http://blog.zemanta.com/de mo/ ● Paste text from www.membranes.com ● The main entity of interest “Hydranautics” is missed ● Common problem with general purpose, open- domain semantic annotation tools ● Best results require bespoke customisation http://www.zemanta.com/demo/
  • 94. University of Sheffield, NLP Semantic Annotation in GATE ● Insert GATE screenshot
  • 95. University of Sheffield, NLP Automatic Semantic Annotation ● Locations (linked to DBpedia and GeoNames) – Annotate the place name itself (e.g. Norwich) with the corresponding DBpedia and GeoNames URIs – Also use knowledge of the implied reference to the levels 1, 2, and 3 sub-divisions from the Nomenclature of Territorial Units for Statistics (NUTS). – For Norwich, these are East of England (UKH – level 1), East Anglia (UKH1 – level 2), and Norfolk (UKH13 – level 3). – Similarly use knowledge to retrieve nearby places
  • 96. University of Sheffield, NLP “South Gloucestershire” Example
  • 97. University of Sheffield, NLP Semantic Annotation (2) ● Organisations (linked to DBpedia) – Names of companies, government organisations, committees, agencies, universities, and other organisations ● Dates – Absolute (e.g. 31/03/2012) and relative (yesterday) ● Measurements and Percentages – e.g. 8,596 km2 , 1 km, one fifth, 10%
  • 98. University of Sheffield, NLP The YODIE Service from GATE http://demos.gate.ac.uk/trendminer/obie/ 98 http://demos.gate.c.uk/trendminer/obie/
  • 99. University of Sheffield, NLP Recognition of term variants in Decarbonet
  • 100. University of Sheffield, NLP ClimaTerm Web service ● http://services.gate.ac.uk/decarbonet/term-recognition/ ● Input: a document ● Output: the text as an XML file annotated with term and URI <Document> <TermInstance=" http://www.eionet.europa.eu/gemet/concept/1148">cars </Term> pollute by emitting <Term Instance="http://reegle.info/glossary/1064">carbon dioxide</Term> </Document> “Cars pollute by emitting carbon dioxide.”
  • 101. University of Sheffield, NLP Term recognition demo
  • 102. University of Sheffield, NLP Semantic Search: An Overview
  • 103. University of Sheffield, NLP GATE Mímir • can be used to index and search over text, annotations, semantic metadata (concepts and instances) • allows queries that arbitrarily mix full-text, structural, linguistic and semantic annotations • is open source
  • 104. University of Sheffield, NLP What can GATE Mímir do that Google can't? Show me: • all documents mentioning a temperature between 30 and 90 degrees F (expressed in any unit) • all abstracts written in French on patent documents from the last 6 months which mention any form of the word “transistor” in the English abstract • the names of the patent inventors of those abstracts • all documents mentioning steel industries in the UK, along with their location
  • 105. University of Sheffield, NLP Search News Articles for Politicians born in Sheffield http://demos.gate.ac.uk/mimir/gpd/search/gus
  • 106. University of Sheffield, NLP Easily Create Your Own Custom GATE Mímir Interfaces http://demos.gate.ac.uk/pin/
  • 107. University of Sheffield, NLP MIMIR: Searching Text Mining Results ● Searching and managing text annotations, semantic information, and full text documents in one search engine ● Queries over annotation graphs ● Regular expressions, Kleene operators ● Designed to be integrated as a web service in custom end-user systems with bespoke interfaces ● Demos at http://services.gate.ac.uk/mimir/
  • 108. University of Sheffield, NLP Modelling Text Annotations and Additional Knowledge Sources
  • 109. University of Sheffield, NLP MIMIR: the technical bit ● Uses a combination of an FTS engine (MG4J), a database for indexing the semantic annotations (H2), and a knowledge repository (OWLIM/Sesame) ● Use database to indexing annotation kinds, then the additional linguistic data for words/annotations are stored in MG4J (POS, morphological root, etc) ● Semantic query: against the database/ knowledge base for the additional knowledge ● A query: decompose into the respective parts, query the appropriate component, then merge the results
  • 110. University of Sheffield, NLP Clustering and Scalability ● All searches are local to a document, so we can use clusters of semantic repositories: ● Faster indexing (less data in each repository) ● Faster searching (search space is broken into slices that are searched in parallel – like Google). ● Joining results is trivial: union of result sets. ● Simple scalability: just add more nodes. ● Search speed stays almost constant while the data increases (each individual repository has the same amount of data; there are just more repositories)
  • 111. University of Sheffield, NLP Ranking of Results ● By default, documents are returned in index order ● The following ranking algorithms are also available (others can be implemented via plugins if required) – Count – Hit length – TF.IDF – BM25 • These algorithms treat annotations within the queries in the same way as words, using index level frequencies etc. • The ranking algorithm can be changed without re-indexing, allowing for easy experimentation
  • 112. University of Sheffield, NLP Scaling Up ● We annotated 1.08 million web pages using a GATE language analysis pipeline. – Documents crawled using Heritrix10 , with total content size of 57 GiB or 6.6 billions plain text characters. – The indexing server has 2 Intel Xeon 2.8GHz CPUs 11 GB of RAM, and runs 64 bit Ubuntu Linux. Indexing process was 94 hours. ● We also indexed 150 million similar web pages, using two hundred Amazon EC2 Large Instances running for a week to produce a federated index ● Mimir runs on GateCloud.net, so easy to scale up
  • 113. University of Sheffield, NLP Search for string “Harriet Harman”
  • 114. University of Sheffield, NLP Search for string “Harriet Harman says”
  • 115. University of Sheffield, NLP Search with morphological variants: Harriet Harman root:say
  • 116. University of Sheffield, NLP Replace strings with NEs: {Person} root:say
  • 117. University of Sheffield, NLP {Person} AND root:say – 11803 hits
  • 118. University of Sheffield, NLP {Person} [0..5] root: say – 5495 hits
  • 119. University of Sheffield, NLP Patent Annotation: Data Model 119
  • 120. University of Sheffield, NLP An Example Text 120
  • 121. University of Sheffield, NLP Hands-On with patent data http://demos.gate.ac.uk/mimir/patents/search/index Text. Matches plain text. Example: nanomaterial Linguistic variations of text Example: (root:nanomaterial | root:nanoparticle ) Annotation. Matches semantic annotations. Syntax: {Type feature1=value1 feature2=value2...} Example: {Abstract lang="DE"} Sequence Query. Sequence of other queries. Syntax: Query1 [n..m] Query2... Example: from {Measurement} [1..5] {Measurement}
  • 122. University of Sheffield, NLP Inclusion Queries IN Query. Hits of one query only if inside another. Syntax: Query1 IN Query2 Example: (root:nanomaterial | root:nanoparticle ) IN {Abstract} Number of times these words are mentioned in patent abstracts (as well as links to the actual documents) OVER Query. Hits of a query, only if overlapping hits of another. Syntax: Query1 OVER Query2 Example: {Abstract} OVER (root:nanomatrial | root:nanoparticle ) Finds all abstracts that contain nanomaterial(s) or nanoparticle(s)
  • 123. University of Sheffield, NLP Date restrictions ( {Abstract lang="EN"} OVER (root:nanomaterial | root:nanoparticle ) ) IN {PatentDocument date > 20050000} YYYYMMDD
  • 124. University of Sheffield, NLP Find references to literature or patents in the prior art or background sections, which contain nanomaterial/nanoparticle ({Reference type="Literature"} | {Reference type="Patent"} ) IN ({Section type="PriorArt"} | {Section type="BackgroundArt"} ) OVER (root:nanomaterial | root:nanoparticle)
  • 125. University of Sheffield, NLP Queries Using External Knowledge {Measurement spec="1 to 100 volts“} ● Uses GNU Units (http://www.gnu.org/software/units/) to convert measurements and normalise them to ISI units {Measurement spec="1 to 100 kg m^2 / A s^3"} ● Example hits: 10 volts, 2V, +20 and -20 volts; ±10V; +/- 100V; +3.3 volts {Measurement spec="1 to 100 m / s"} ● Example hits: 40 km/hr, 60m/min, 100cm/sec, 60 fps; 10 to 2000 cm/sec
  • 126. University of Sheffield, NLP Searching LOD with SPARQL • SQL-like query language for RDF data • Simple protocol for querying remote databases over HTTP • Query types • select: projections of variables and expressions • construct: create triples (or graphs) based on query results • ask: whether a query returns results (result is true/false) • describe: describes resources in the graph
  • 127. University of Sheffield, NLP SPARQL Example # Software companies founded in the US PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX dbpedia: <http://dbpedia.org/resource/> PREFIX dbp-ont: <http://dbpedia.org/ontology/> PREFIX geo-ont: <http://www.geonames.org/ontology#> PREFIX umbel-sc: <http://umbel.org/umbel/sc/> SELECT DISTINCT ?Company ?Location WHERE { ?Company rdf:type dbp-ont:Company ; dbp-ont:industry dbpedia:Computer_software ; dbp-ont:foundationPlace ?Location . ?Location geo-ont:parentFeature dbpedia:United_States . }
  • 128. University of Sheffield, NLP SPARQL Results ● Try at: http://factforge.net/sparql
  • 129. University of Sheffield, NLP {Person sparql = "SELECT ?inst WHERE { ?inst :birthPlace <http://dbpedia.org/resource/Sheffield>}"} Documents mentioning Persons born in Sheffield The BBC web document does not mention Sheffield at all: http://www.bbc.co.uk/news/uk-politics-11494915 The relevant text snippet is:
  • 130. University of Sheffield, NLP Try these on the BBC News demo: http://services.gate.ac.uk/mimir/gpd/search/index ● Gordon Brown [0..3] root:say ● {Person} [0..3] root:say ● {Person inst ="http://dbpedia.org/resource/Gordon_Brown"} [0..3] root:say ● {Person sparql="SELECT ?inst WHERE { ?inst a :Politician }"} [0..3] root:say ● {Person sparql = "SELECT ?inst WHERE { ?inst :party <http://dbpedia.org/resource/Labour_Party_%28UK%29> }" } [0..3] root:say ● {Person sparql = "SELECT ?inst WHERE { ?inst :party <http://dbpedia.org/resource/Labour_Party_ %28UK%29> . ?inst :almaMater <http://dbpedia.org/resource/University_of_Edinburgh> }" } [0..3] root:say
  • 131. University of Sheffield, NLP User Interfaces for SPARQL-based Semantic Search • SPARQL-based semantic searches, tapping into LOD resources are extremely powerful • However, impossible to write by the vast majority of users • User interfaces for SPARQL-based semantic search: • Faceted searches (see ExoPatent next) • Form-based searches (see EnviLOD, PIN) • Text-based searches (natural language interfaces for querying ontologies), e.g. FREyA
  • 132. University of Sheffield, NLP Faceted Search: ExoPatent Example ● Use semantic information to expose linkages between documents based on the intersecting relationships between various sets of data from ● FDA Orange Book (23,000 patented drugs ● Unified Medical Language System (UMLS) – database of medical terms (370,000) ● Patent bibliographic information ● Search for diseases, drug names, body parts, references to literature and other patents, numeric values, ranges ● Demo uses a small set of patents (40,000)
  • 133. University of Sheffield, NLP ExoPatent: Faceted Search
  • 134. University of Sheffield, NLP Annotated Patent Document
  • 135. University of Sheffield, NLP Find all applicants who filed patents related to mitochondria, as well as drug names and active ingredients
  • 136. University of Sheffield, NLP Semantic Search over Content and Annotations http://demos.gate.ac.uk/trendminer/envilod
  • 137. University of Sheffield, NLP EnviLOD Semantic Search UI
  • 138. University of Sheffield, NLP EnviLOD Semantic Search UI
  • 139. University of Sheffield, NLP EnviLOD Semantic Search UI
  • 140. University of Sheffield, NLP Example Results
  • 141. University of Sheffield, NLP …and the underlying SPARQL Query {Sem_Location dbpediaSparql="select distinct ?inst where { {{ ?inst <http://dbpedia.org/property/north> ?loc} UNION { ?inst <http://dbpedia.org/property/east> ?loc } UNION { ?inst <http://dbpedia.org/property/west> ?loc } UNION { ?inst <http://dbpedia.org/property/south> ?loc } UNION { ?inst <http://dbpedia.org/property/northeast> ?loc } UNION { ?inst <http://dbpedia.org/property/northwest> ?loc } UNION { ?inst <http://dbpedia.org/property/southeast> ?loc } UNION { ?inst <http://dbpedia.org/property/southwest> ?loc } FILTER(REGEX(STR(?loc), "Sheffield", "i"))} }"} AND (root:"flood") Ongoing work: Use GeoSparql instead and be able to specify distances and reason with the richer information in GeoNames
  • 142. University of Sheffield, NLP Flooding in Oxford
  • 143. University of Sheffield, NLP Some Caveats…. ● EnviLOD demonstrator built in a 6 month project ● VERY limited environmental science content indexed – Some Defra, Environment Agency, Scottish Government reports ● PDFs of the articles are not connected to
  • 144. University of Sheffield, NLP We don't just have to look at politicians saying and measuring things ● If we first process the text with other NLP tools such as sentiment analysis, we can also search for positive or negative documents ● Or positive/negative comments about certain people/topics/things/companies ● In the Decarbonet project, we're looking at people's opinions about topics relating to climate change, e.g. fracking ● We could index on the sentiment annotations too ● Other people are using the combination of opinion mining and MIMIR to look at e.g. customer feedback about products / companies on a huge corpus
  • 145. University of Sheffield, NLP A positive tweet
  • 146. University of Sheffield, NLP A negative tweet
  • 147. University of Sheffield, NLP A Sarcastic Tweet
  • 148. University of Sheffield, NLP Explicitly Choosing The Search Classes
  • 149. University of Sheffield, NLP Choosing A Specific Instance
  • 150. University of Sheffield, NLP What diseases are in these documents?
  • 151. University of Sheffield, NLP What Pathogens?
  • 152. University of Sheffield, NLP Disease vs Disease Co-ocurrences
  • 153. University of Sheffield, NLP Diseases vs Pathogens
  • 154. University of Sheffield, NLP Summary ● Text mining is a very useful pre-requisite for doing all kinds of more interesting things, like semantic search ● Semantic web search allows you to do much more interesting kinds of search than a standard text-based search ● Text mining is hard, so it won't always be correct, however ● This is especially true on lower quality text such as social media ● On the other hand, social media has some of the most interesting material to look at ● Still plenty of work to be done, but there are lots of tools that you can use now and get useful results ● We run annual GATE training courses where you can spend a whole week learning all this and more!
  • 155. University of Sheffield, NLP Acknowledgements and further information ● Research partially supported by the European Union/EU under the Information and Communication Technologies (ICT) theme of the 7th Framework Programme for R&D (FP7) DecarboNet (610829) http://www.decarbonet.eu and TrendMiner (287863), Nesta ● PFT: https://gate.ac.uk/projects/pft/ ● Semantic Search over Documents and Ontologies (2014) K Bontcheva, V Tablan, H Cunningham. Bridging Between Information Retrieval and Databases, 31-53 ● V. Tablan, K. Bontcheva, I. Roberts, and H. Cunningham. Mímir: An open-source semantic search framework for interactive information seeking and discovery. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, 2014 ● These and other relevant papers on the GATE website: http://gate.ac.uk/gate/doc/papers.html ● Annual GATE training course in June in Sheffield: https://gate.ac.uk/family/training.html ● Download gate: http://gate.ac.uk/download