SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
Time Expression Analysis and Recognition
Using Syntactic Token Types and General Heuristic Rules
Xiaoshi Zhong, Aixin Sun, and Erik Cambria
Computer Science and Engineering
Nanyang Technological University
{xszhong, axsun, cambria}@ntu.edu.sg
Outline
•Time expression analysis
• Datasets: TimeBank, Gigaword, WikiWars, Tweets
• Findings: short expressions, occurrence, small vocabulary, similar syntactic
behavior
•Time expression recognition
• SynTime: syntactic token types and general heuristic rules
• Baselines: HeidelTime, SUTime, UWTime
Time Expression Analysis
•Datasets
• TimeBank
• Gigaword
• WikiWars
• Tweets
•Findings
• Short time expressions
• Occurrence
• Small vocabulary
• Similar syntactic behaviour
now
today
Friday
February
the last week
13 January 1951
June 30, 1990
8 to 20 days
the third quarter of 1984
…
Example time expressions:
Time Expression Analysis - Datasets
•Datasets
• TimeBank: a benchmark dataset used in TempEval series
• Gigaword: a large dataset with generated labels and used in TempEval-3
• WikiWars: a specific domain dataset collected from Wikipedia about war
• Tweets: a manually labeled dataset with informal text collected from Twitter
•Statistics of the datasets
Dataset #Docs #Words #TIMEX
TimeBank 183 61,418 1,243
Gigaword 2,452 666,309 12,739
WikiWars 22 119,468 2,671
Tweets 942 18,199 1,127
The four datasets vary in source, size, domain,
and text type, but we will see that their time
expressions demonstrate similar characteristics.
Time Expression Analysis – Finding 1
•Short time expressions: time expressions are very short.
Time expressions follow a similar length distribution
Dataset Average length
TimeBank 2.00
Gigaword 1.70
WikiWars 2.38
Tweets 1.51
Average length of time expressions
80% of time expressions contain ≤3 words
90% of time expressions contain ≤4 words
Average length: about 2 words
Time Expression Analysis – Finding 2
•Occurrence: most of time expressions contain time token(s).
Percentage of time expressions that
contain time token(s)
Example time tokens (red):
Dataset Percentage
TimeBank 94.61
Gigaword 96.44
WikiWars 91.81
Tweets 96.01
now
today
Friday
February
the last week
13 January 1951
June 30, 1990
8 to 20 days
the third quarter of 1984
…
Time Expression Analysis – Finding 3
•Small vocabulary: only a small group of time words are used to
express time information.
Number of distinct words and time tokens in time expressions
Dataset #Words #Time tokens
TimeBank 130 64
Gigaword 214 80
WikiWars 224 74
Tweets 107 64
45 distinct time tokens appear in all the four datasets.
That means, time expressions highly overlap at their time tokens.
#Words #Time tokens
350 123
Number of distinct words and time tokens across four datasets
next year
2 years
year 1
10 yrs ago
Overlap at year
Time Expression Analysis – Finding 4
•Similar syntactic behaviour: (1) POS information cannot
distinguish time expressions from common text, but (2) within time
expressions, POS tags can help distinguish their constituents.
• (1) For the top 40 POS tags (10 × 4 datasets), 37 have percentage lower than
20%, other 3 are CD.
• (2) Time tokens mainly have NN* and RB, modifiers have JJ and RB, and
numerals have CD.
Time Expression Analysis – Eureka!
•Similar syntactic behaviour: (1) POS information cannot
distinguish time expressions from common text, but (2) within time
expressions, POS tags can help distinguish their constituents.
• (1) For the top 40 POS tags (10 × 4 datasets), 37 have percentage lower than
20%, other 3 are CD.
• (2) Time tokens mainly have NN* and RB, modifiers have JJ and RB, and
numerals have CD.
When seeing (2), we realize that this is exactly how linguists define part-of-speech for
language; similar words have similar syntactic behaviour. The definition of part-of-speech
for language inspires us to define a type system for the time expression, part of language.
Our Eureka! moment
Time Expression Analysis - Summary
• Summary
• On average, a time expression contains two tokens; one is time token and the
other is modifier/numeral. And the time tokens are in small size.
• Idea for recognition
• To recognize a time expression, we first recognize the time token, then
recognize the modifier/numeral.
Time Expression Analysis - Idea
• Summary
• On average, a time expression contains two tokens; one is time token and the
other is modifier/numeral. And the time tokens are in small size.
• Idea for recognition
• To recognize a time expression, we first recognize the time token, then
recognize the modifier/numeral.
20 days; this week; next year; July 29; …
Time Expression Analysis - Idea
• Summary
• On average, a time expression contains two tokens; one is time token and the
other is modifier/numeral. And the time tokens are in small size.
• Idea for recognition
• To recognize a time expression, we first recognize the time token, then
recognize the modifier/numeral.
20 days; this week; next year; July 29; …
Time token
Time Expression Analysis - Idea
• Summary
• On average, a time expression contains two tokens; one is time token and the
other is modifier/numeral. And the time tokens are in small size.
• Idea for recognition
• To recognize a time expression, we first recognize the time token, then
recognize the modifier/numeral.
20 days; this week; next year; July 29; …
Time token Modifier/Numeral
Time Expression Recognition
• SynTime
• Syntactic token types
• General heuristic rules
• Baseline methods
• HeidelTime
• SUTime
• UWTime
• Experiment datasets
• TimeBank
• WikiWars
• Tweets
Time Expression Recognition - SynTime
• Syntactic token types
• General heuristic rules
Time Expression Recognition - SynTime
• Syntactic token types – A type system
• Time token: explicitly express time information, e.g., “year”
• 15 token types: DECADE, YEAR, SEASON, MONTH, WEEK, DATE, TIME, DAY_TIME, TIMELINE,
HOLIDAY, PERIOD, DURATION, TIME_UNIT, TIME_ZONE, ERA
• Modifier: modify time tokens, e.g., “next” modifies “year” in “next year”
• 5 token types: PREFIX, SUFFIX, LINKAGE, COMMA, IN_ARTICLE
• Numeral: ordinals and numbers, e.g., “10” in “next 10 years”
• 1 token type: NUMERAL
• Token types to tokens is like POS tags to words
• POS tags: next/JJ 10/CD years/NNS
• Token types: next/PREFIX 10/NUMERAL years/TIME_UNIT
Time Expression Recognition - SynTime
• General heuristic rules
• Only relevant to token types
• Independent of specific tokens
SynTime – Layout
General Heuristic Rules
1989, February, 12:55, this year, 3 months ago, ...
Time Token, Modifier, Numeral
Rule level
Type level
Token level
Token level: time-related tokens and token regular expressions
Type level: token types group the tokens and token regular expressions
Rule level: heuristic rules work on token types and are independent of specific tokens
SynTime – Overview in practice
Identify time tokens
Identify modifiers and
numerals by expanding the
time tokens’ boundaries
Extract time expressions
Import token regex to time
token, modifier, numeral
Add keywords under
defined token types and
do not change any rules
An example: the third quarter of 1984
A sequence of tokens: the third quarter of 1984
An example: the third quarter of 1984
A sequence of tokens:
Assign tokens with token types
the third quarter of 1984
PREFIX NUMERAL TIME_UNIT PREFIX YEAR
An example: the third quarter of 1984
A sequence of tokens:
Assign tokens with token types
Identify time tokens
the third quarter of 1984
PREFIX NUMERAL TIME_UNIT PREFIX YEAR
Heuristic Rules
An example: the third quarter of 1984
Identify modifiers and numerals by
searching time tokens’ surroundings
A sequence of tokens:
Assign tokens with token types
Identify time tokens
the third quarter of 1984
PREFIX NUMERAL TIME_UNIT PREFIX YEAR
Heuristic Rules
An example: the third quarter of 1984
Identify modifiers and numerals by
searching time tokens’ surroundings
A sequence of tokens:
Assign tokens with token types
Identify time tokens
the third quarter of 1984
PREFIX NUMERAL TIME_UNIT PREFIX YEAR
Heuristic Rules
An example: the third quarter of 1984
Identify modifiers and numerals by
searching time tokens’ surroundings
A sequence of tokens:
Assign tokens with token types
Identify time tokens
the third quarter of 1984
PREFIX NUMERAL TIME_UNIT PREFIX YEAR
Heuristic Rules
An example: the third quarter of 1984
Identify modifiers and numerals by
searching time tokens’ surroundings
A sequence of tokens:
Assign tokens with token types
Identify time tokens
the third quarter of 1984
PREFIX NUMERAL TIME_UNIT PREFIX YEAR
Heuristic Rules
An example: the third quarter of 1984
Identify modifiers and numerals by
searching time tokens’ surroundings
A sequence of tokens:
Assign tokens with token types
Identify time tokens
the third quarter of 1984
PREFIX NUMERAL TIME_UNIT PREFIX YEAR
Heuristic Rules
An example: the third quarter of 1984
A sequence of token types PREFIX NUMERAL TIME_UNIT PREFIX YEAR
An example: the third quarter of 1984
Export a sequence of tokens
as time expression
A sequence of token types
the third quarter of 1984
PREFIX NUMERAL TIME_UNIT PREFIX YEAR
An example: the third quarter of 1984
Time expression: the third quarter of 1984
Time Expression Recognition - Experiments
• SynTime
• SynTime-I: Initial version
• SynTime-E: Expanded version, adding keywords to SynTime-I
(Add keywords under the defined token types and do not change any rules.)
• Baseline methods
• HeidelTime: rule-based method
• SUTime: rule-based method
• UWTime: learning-based method
• Experiment datasets
• TimeBank: comprehensive data in formal text
• WikiWars: specific domain data in formal text
• Tweets: comprehensive data in informal text
Dataset Methods
Strict Match Relexed Match
Pr. Re. F1 Pr. Re. F1
TimeBank
HeidelTime(Strotgen et al., 2013) 83.85 78.99 81.34 93.08 87.68 90.30
SUTime(Chang and Manning, 2013) 78.72 80.43 79.57 89.36 91.30 90.32
UWTime(Lee et al., 2014) 86.10 80.40 83.10 94.60 88.40 91.40
SynTime-I 91.43 92.75 92.09 94.29 95.65 94.96
SynTime-E 91.49 93.48 92.47 93.62 95.65 94.62
WikiWars
HeidelTime(Lee et al., 2014) 85.20 79.30 82.10 92.60 86.20 89.30
SUTime 78.61 76.69 76.64 95.74 89.57 92.55
UWTime(Lee et al., 2014) 87.70 78.80 83.00 97.60 87.60 92.30
SynTime-I 80.00 80.22 80.11 92.16 92.41 92.29
SynTime-E 79.18 83.47 81.27 90.49 95.39 92.88
Tweets
HeidelTime 89.58 72.88 80.37 95.83 77.97 85.98
SUTime 76.03 77.97 76.99 88.43 90.68 89.54
UWTime 88.54 72.03 79.44 96.88 78.81 86.92
SynTime-I 89.52 94.07 91.74 93.55 98.31 95.87
SynTime-E 89.20 94.49 91.77 93.20 98.78 95.88
Overall performance. The best results are in boldface and the second best are underlined. Some results are
borrowed from their original papers and the papers indicated by the references.
Difference from other Rule-based Methods
General Heuristic Rules
1989, February, 12:55, this year, 3 months ago, ...
Time Token, Modifier, Numeral
Rule level
Type level
Token level
SynTime
Heuristic rules work on token types and are independent
of specific tokens, thus they are independent of specific
domains and specific text types and specific languages.
the third quarter of 1984
PREFIX NUMERAL TIME_UNIT PREFIX YEAR
Heuristic Rules
Method
Layout
Property
Example
Deterministic Rules
1989, February, 12:55, this year, 3 months ago, ...
Rule level
Token level
Other rule-based methods
Deterministic rules directly work on tokens and phrases
in a fixed manner, thus the taggers lack flexibility
/the/? [{tag:JJ}]? ($NUM_ORD) /-/? [{tag:JJ}]? /quarter/
A simple idea
Rules can be designed with
generality and heuristics
the third quarter of 1984
PREFIX NUMERAL TIME_UNIT PREFIX YEAR
Heuristic Rules

Contenu connexe

Similaire à Xiaoshi Zhong - 2017 - Time Expression Analysis and Recognition Using Syntactic Token Types and General Heuristic Rules

Term paper presentation 2014 (1)
Term paper presentation 2014 (1)Term paper presentation 2014 (1)
Term paper presentation 2014 (1)
Gaya Zimrutyan
 
Thesis Writing format.pptx
Thesis Writing format.pptxThesis Writing format.pptx
Thesis Writing format.pptx
Ajit Shinde
 

Similaire à Xiaoshi Zhong - 2017 - Time Expression Analysis and Recognition Using Syntactic Token Types and General Heuristic Rules (17)

Riyadh UseR Group - 1st Meeting (Dec 2016(
Riyadh UseR Group - 1st Meeting (Dec 2016(Riyadh UseR Group - 1st Meeting (Dec 2016(
Riyadh UseR Group - 1st Meeting (Dec 2016(
 
Session-3-Module-2.pptx
Session-3-Module-2.pptxSession-3-Module-2.pptx
Session-3-Module-2.pptx
 
Concepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineConcepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search Engine
 
mt_cat_presentations CAT TRANSLATION PPT
mt_cat_presentations CAT TRANSLATION PPTmt_cat_presentations CAT TRANSLATION PPT
mt_cat_presentations CAT TRANSLATION PPT
 
MODELS 2019: Querying and annotating model histories with time-aware patterns
MODELS 2019: Querying and annotating model histories with time-aware patternsMODELS 2019: Querying and annotating model histories with time-aware patterns
MODELS 2019: Querying and annotating model histories with time-aware patterns
 
6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptx6_Big Data Sources part3-Day 3_A_text_mining.pptx
6_Big Data Sources part3-Day 3_A_text_mining.pptx
 
Temporal Web Dynamics: Implications from Search Perspective
Temporal Web Dynamics: Implications from Search PerspectiveTemporal Web Dynamics: Implications from Search Perspective
Temporal Web Dynamics: Implications from Search Perspective
 
A combination of reduction and expansion approaches to handle with long natur...
A combination of reduction and expansion approaches to handle with long natur...A combination of reduction and expansion approaches to handle with long natur...
A combination of reduction and expansion approaches to handle with long natur...
 
Mining, Representation and Reasoning with Temporal Expressions in the Legal D...
Mining, Representation and Reasoning with Temporal Expressions in the Legal D...Mining, Representation and Reasoning with Temporal Expressions in the Legal D...
Mining, Representation and Reasoning with Temporal Expressions in the Legal D...
 
Term paper presentation 2014 (1)
Term paper presentation 2014 (1)Term paper presentation 2014 (1)
Term paper presentation 2014 (1)
 
Thesis Writing format.pptx
Thesis Writing format.pptxThesis Writing format.pptx
Thesis Writing format.pptx
 
Digging Deeper into the Common Core
Digging Deeper into the Common CoreDigging Deeper into the Common Core
Digging Deeper into the Common Core
 
Temporal models for mining, ranking and recommendation in the Web
Temporal models for mining, ranking and recommendation in the WebTemporal models for mining, ranking and recommendation in the Web
Temporal models for mining, ranking and recommendation in the Web
 
introduction into IR
introduction into IRintroduction into IR
introduction into IR
 
Analysis of Various Periodicity Detection Algorithms in Time Series Data with...
Analysis of Various Periodicity Detection Algorithms in Time Series Data with...Analysis of Various Periodicity Detection Algorithms in Time Series Data with...
Analysis of Various Periodicity Detection Algorithms in Time Series Data with...
 
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
 
Diachronic Analysis of Language exploiting Google Ngram
Diachronic Analysis of Language exploiting Google NgramDiachronic Analysis of Language exploiting Google Ngram
Diachronic Analysis of Language exploiting Google Ngram
 

Plus de Association for Computational Linguistics

Plus de Association for Computational Linguistics (20)

Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal TextMuis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
 
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
 
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour AnalysisCastro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
 
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
 
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Elior Sulem - 2018 - Semantic Structural Evaluation for Text SimplificationElior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
 
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
 
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
 
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
 

Dernier

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
ssuserdda66b
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Dernier (20)

Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 

Xiaoshi Zhong - 2017 - Time Expression Analysis and Recognition Using Syntactic Token Types and General Heuristic Rules

  • 1. Time Expression Analysis and Recognition Using Syntactic Token Types and General Heuristic Rules Xiaoshi Zhong, Aixin Sun, and Erik Cambria Computer Science and Engineering Nanyang Technological University {xszhong, axsun, cambria}@ntu.edu.sg
  • 2. Outline •Time expression analysis • Datasets: TimeBank, Gigaword, WikiWars, Tweets • Findings: short expressions, occurrence, small vocabulary, similar syntactic behavior •Time expression recognition • SynTime: syntactic token types and general heuristic rules • Baselines: HeidelTime, SUTime, UWTime
  • 3. Time Expression Analysis •Datasets • TimeBank • Gigaword • WikiWars • Tweets •Findings • Short time expressions • Occurrence • Small vocabulary • Similar syntactic behaviour now today Friday February the last week 13 January 1951 June 30, 1990 8 to 20 days the third quarter of 1984 … Example time expressions:
  • 4. Time Expression Analysis - Datasets •Datasets • TimeBank: a benchmark dataset used in TempEval series • Gigaword: a large dataset with generated labels and used in TempEval-3 • WikiWars: a specific domain dataset collected from Wikipedia about war • Tweets: a manually labeled dataset with informal text collected from Twitter •Statistics of the datasets Dataset #Docs #Words #TIMEX TimeBank 183 61,418 1,243 Gigaword 2,452 666,309 12,739 WikiWars 22 119,468 2,671 Tweets 942 18,199 1,127 The four datasets vary in source, size, domain, and text type, but we will see that their time expressions demonstrate similar characteristics.
  • 5. Time Expression Analysis – Finding 1 •Short time expressions: time expressions are very short. Time expressions follow a similar length distribution Dataset Average length TimeBank 2.00 Gigaword 1.70 WikiWars 2.38 Tweets 1.51 Average length of time expressions 80% of time expressions contain ≤3 words 90% of time expressions contain ≤4 words Average length: about 2 words
  • 6. Time Expression Analysis – Finding 2 •Occurrence: most of time expressions contain time token(s). Percentage of time expressions that contain time token(s) Example time tokens (red): Dataset Percentage TimeBank 94.61 Gigaword 96.44 WikiWars 91.81 Tweets 96.01 now today Friday February the last week 13 January 1951 June 30, 1990 8 to 20 days the third quarter of 1984 …
  • 7. Time Expression Analysis – Finding 3 •Small vocabulary: only a small group of time words are used to express time information. Number of distinct words and time tokens in time expressions Dataset #Words #Time tokens TimeBank 130 64 Gigaword 214 80 WikiWars 224 74 Tweets 107 64 45 distinct time tokens appear in all the four datasets. That means, time expressions highly overlap at their time tokens. #Words #Time tokens 350 123 Number of distinct words and time tokens across four datasets next year 2 years year 1 10 yrs ago Overlap at year
  • 8. Time Expression Analysis – Finding 4 •Similar syntactic behaviour: (1) POS information cannot distinguish time expressions from common text, but (2) within time expressions, POS tags can help distinguish their constituents. • (1) For the top 40 POS tags (10 × 4 datasets), 37 have percentage lower than 20%, other 3 are CD. • (2) Time tokens mainly have NN* and RB, modifiers have JJ and RB, and numerals have CD.
  • 9. Time Expression Analysis – Eureka! •Similar syntactic behaviour: (1) POS information cannot distinguish time expressions from common text, but (2) within time expressions, POS tags can help distinguish their constituents. • (1) For the top 40 POS tags (10 × 4 datasets), 37 have percentage lower than 20%, other 3 are CD. • (2) Time tokens mainly have NN* and RB, modifiers have JJ and RB, and numerals have CD. When seeing (2), we realize that this is exactly how linguists define part-of-speech for language; similar words have similar syntactic behaviour. The definition of part-of-speech for language inspires us to define a type system for the time expression, part of language. Our Eureka! moment
  • 10. Time Expression Analysis - Summary • Summary • On average, a time expression contains two tokens; one is time token and the other is modifier/numeral. And the time tokens are in small size. • Idea for recognition • To recognize a time expression, we first recognize the time token, then recognize the modifier/numeral.
  • 11. Time Expression Analysis - Idea • Summary • On average, a time expression contains two tokens; one is time token and the other is modifier/numeral. And the time tokens are in small size. • Idea for recognition • To recognize a time expression, we first recognize the time token, then recognize the modifier/numeral. 20 days; this week; next year; July 29; …
  • 12. Time Expression Analysis - Idea • Summary • On average, a time expression contains two tokens; one is time token and the other is modifier/numeral. And the time tokens are in small size. • Idea for recognition • To recognize a time expression, we first recognize the time token, then recognize the modifier/numeral. 20 days; this week; next year; July 29; … Time token
  • 13. Time Expression Analysis - Idea • Summary • On average, a time expression contains two tokens; one is time token and the other is modifier/numeral. And the time tokens are in small size. • Idea for recognition • To recognize a time expression, we first recognize the time token, then recognize the modifier/numeral. 20 days; this week; next year; July 29; … Time token Modifier/Numeral
  • 14. Time Expression Recognition • SynTime • Syntactic token types • General heuristic rules • Baseline methods • HeidelTime • SUTime • UWTime • Experiment datasets • TimeBank • WikiWars • Tweets
  • 15. Time Expression Recognition - SynTime • Syntactic token types • General heuristic rules
  • 16. Time Expression Recognition - SynTime • Syntactic token types – A type system • Time token: explicitly express time information, e.g., “year” • 15 token types: DECADE, YEAR, SEASON, MONTH, WEEK, DATE, TIME, DAY_TIME, TIMELINE, HOLIDAY, PERIOD, DURATION, TIME_UNIT, TIME_ZONE, ERA • Modifier: modify time tokens, e.g., “next” modifies “year” in “next year” • 5 token types: PREFIX, SUFFIX, LINKAGE, COMMA, IN_ARTICLE • Numeral: ordinals and numbers, e.g., “10” in “next 10 years” • 1 token type: NUMERAL • Token types to tokens is like POS tags to words • POS tags: next/JJ 10/CD years/NNS • Token types: next/PREFIX 10/NUMERAL years/TIME_UNIT
  • 17. Time Expression Recognition - SynTime • General heuristic rules • Only relevant to token types • Independent of specific tokens
  • 18. SynTime – Layout General Heuristic Rules 1989, February, 12:55, this year, 3 months ago, ... Time Token, Modifier, Numeral Rule level Type level Token level Token level: time-related tokens and token regular expressions Type level: token types group the tokens and token regular expressions Rule level: heuristic rules work on token types and are independent of specific tokens
  • 19. SynTime – Overview in practice Identify time tokens Identify modifiers and numerals by expanding the time tokens’ boundaries Extract time expressions Import token regex to time token, modifier, numeral Add keywords under defined token types and do not change any rules
  • 20. An example: the third quarter of 1984 A sequence of tokens: the third quarter of 1984
  • 21. An example: the third quarter of 1984 A sequence of tokens: Assign tokens with token types the third quarter of 1984 PREFIX NUMERAL TIME_UNIT PREFIX YEAR
  • 22. An example: the third quarter of 1984 A sequence of tokens: Assign tokens with token types Identify time tokens the third quarter of 1984 PREFIX NUMERAL TIME_UNIT PREFIX YEAR Heuristic Rules
  • 23. An example: the third quarter of 1984 Identify modifiers and numerals by searching time tokens’ surroundings A sequence of tokens: Assign tokens with token types Identify time tokens the third quarter of 1984 PREFIX NUMERAL TIME_UNIT PREFIX YEAR Heuristic Rules
  • 24. An example: the third quarter of 1984 Identify modifiers and numerals by searching time tokens’ surroundings A sequence of tokens: Assign tokens with token types Identify time tokens the third quarter of 1984 PREFIX NUMERAL TIME_UNIT PREFIX YEAR Heuristic Rules
  • 25. An example: the third quarter of 1984 Identify modifiers and numerals by searching time tokens’ surroundings A sequence of tokens: Assign tokens with token types Identify time tokens the third quarter of 1984 PREFIX NUMERAL TIME_UNIT PREFIX YEAR Heuristic Rules
  • 26. An example: the third quarter of 1984 Identify modifiers and numerals by searching time tokens’ surroundings A sequence of tokens: Assign tokens with token types Identify time tokens the third quarter of 1984 PREFIX NUMERAL TIME_UNIT PREFIX YEAR Heuristic Rules
  • 27. An example: the third quarter of 1984 Identify modifiers and numerals by searching time tokens’ surroundings A sequence of tokens: Assign tokens with token types Identify time tokens the third quarter of 1984 PREFIX NUMERAL TIME_UNIT PREFIX YEAR Heuristic Rules
  • 28. An example: the third quarter of 1984 A sequence of token types PREFIX NUMERAL TIME_UNIT PREFIX YEAR
  • 29. An example: the third quarter of 1984 Export a sequence of tokens as time expression A sequence of token types the third quarter of 1984 PREFIX NUMERAL TIME_UNIT PREFIX YEAR
  • 30. An example: the third quarter of 1984 Time expression: the third quarter of 1984
  • 31. Time Expression Recognition - Experiments • SynTime • SynTime-I: Initial version • SynTime-E: Expanded version, adding keywords to SynTime-I (Add keywords under the defined token types and do not change any rules.) • Baseline methods • HeidelTime: rule-based method • SUTime: rule-based method • UWTime: learning-based method • Experiment datasets • TimeBank: comprehensive data in formal text • WikiWars: specific domain data in formal text • Tweets: comprehensive data in informal text
  • 32. Dataset Methods Strict Match Relexed Match Pr. Re. F1 Pr. Re. F1 TimeBank HeidelTime(Strotgen et al., 2013) 83.85 78.99 81.34 93.08 87.68 90.30 SUTime(Chang and Manning, 2013) 78.72 80.43 79.57 89.36 91.30 90.32 UWTime(Lee et al., 2014) 86.10 80.40 83.10 94.60 88.40 91.40 SynTime-I 91.43 92.75 92.09 94.29 95.65 94.96 SynTime-E 91.49 93.48 92.47 93.62 95.65 94.62 WikiWars HeidelTime(Lee et al., 2014) 85.20 79.30 82.10 92.60 86.20 89.30 SUTime 78.61 76.69 76.64 95.74 89.57 92.55 UWTime(Lee et al., 2014) 87.70 78.80 83.00 97.60 87.60 92.30 SynTime-I 80.00 80.22 80.11 92.16 92.41 92.29 SynTime-E 79.18 83.47 81.27 90.49 95.39 92.88 Tweets HeidelTime 89.58 72.88 80.37 95.83 77.97 85.98 SUTime 76.03 77.97 76.99 88.43 90.68 89.54 UWTime 88.54 72.03 79.44 96.88 78.81 86.92 SynTime-I 89.52 94.07 91.74 93.55 98.31 95.87 SynTime-E 89.20 94.49 91.77 93.20 98.78 95.88 Overall performance. The best results are in boldface and the second best are underlined. Some results are borrowed from their original papers and the papers indicated by the references.
  • 33. Difference from other Rule-based Methods General Heuristic Rules 1989, February, 12:55, this year, 3 months ago, ... Time Token, Modifier, Numeral Rule level Type level Token level SynTime Heuristic rules work on token types and are independent of specific tokens, thus they are independent of specific domains and specific text types and specific languages. the third quarter of 1984 PREFIX NUMERAL TIME_UNIT PREFIX YEAR Heuristic Rules Method Layout Property Example Deterministic Rules 1989, February, 12:55, this year, 3 months ago, ... Rule level Token level Other rule-based methods Deterministic rules directly work on tokens and phrases in a fixed manner, thus the taggers lack flexibility /the/? [{tag:JJ}]? ($NUM_ORD) /-/? [{tag:JJ}]? /quarter/
  • 34. A simple idea Rules can be designed with generality and heuristics the third quarter of 1984 PREFIX NUMERAL TIME_UNIT PREFIX YEAR Heuristic Rules