SlideShare a Scribd company logo
1 of 38
Download to read offline
The CW Corpus
A new resource for evaluating the
identification of complex words
Matthew Shardlow
The University of Manchester

http://lexicalsimplification.blogspot.co.uk

1
Lexical Simplification
Complex Word
Identification

http://lexicalsimplification.blogspot.co.uk

He profoundly changed.

2
Lexical Simplification
Complex Word
Identification

Substitution
Generation

http://lexicalsimplification.blogspot.co.uk

He profoundly changed.
Profoundly: extremely, very,
deeply, acutely

2
Lexical Simplification
Complex Word
Identification

He profoundly changed.
Profoundly: extremely, very,
deeply, acutely

Word Sense
Disambiguation

Profoundly: extremely, very,
deeply, acutely

`

Substitution
Generation

http://lexicalsimplification.blogspot.co.uk

2
Lexical Simplification
Complex Word
Identification

He profoundly changed.

Substitution
Generation

Profoundly: extremely, very,
deeply, acutely

Word Sense
Disambiguation

Profoundly: extremely, very,
deeply, acutely

Synonym
Ranking
http://lexicalsimplification.blogspot.co.uk

#1) deeply
#2) extremely
#3) acutely
2
Complex Words
●

How do we define a Complex Word?

http://lexicalsimplification.blogspot.co.uk

3
Complex Words
●

How do we define a Complex Word?

●

Manual Definition
–

Any word which impedes a reader's comprehension
of a text.

http://lexicalsimplification.blogspot.co.uk

3
Complex Words
●

How do we define a Complex Word?

●

Manual Definition
–

●

Any word which impedes a reader's comprehension
of a text.

Heuristic Features
–

Frequency

–

Familiarity

–

Length

–

Context

http://lexicalsimplification.blogspot.co.uk

3
Complex Word
Identification
●

Important to get it right: Propagation errors
Correct:
He profoundly changed

He deeply changed

Incorrect:
He profoundly changed

He profoundly turned

http://lexicalsimplification.blogspot.co.uk

4
Complex Word
Identification
●

Important to get it right: Propagation errors
Correct:
He profoundly changed
Incorrect:
He profoundly changed

●

He deeply changed

He profoundly turned

No evaluation data.

http://lexicalsimplification.blogspot.co.uk

4
Complex Word
Identification
●

Important to get it right: Propagation errors
Correct:
He profoundly changed

He deeply changed

Incorrect:
He profoundly changed

He profoundly turned

●

No evaluation data.

●

Gold standard data required.

http://lexicalsimplification.blogspot.co.uk

4
Gold Standard Data
●

Criteria for corpus entries:
–

Annotated Sentences.

–

Coherent English.

–

One complex word per sentence.

http://lexicalsimplification.blogspot.co.uk

5
Gold Standard Data
●

Criteria for corpus entries:
–

Annotated Sentences.

–

Coherent English.

–

One complex word per sentence.

●

Difficult to generate automatically.

●

Expensive to generate manually.

http://lexicalsimplification.blogspot.co.uk

5
Gold Standard Data
●

Criteria for corpus entries:
–

Annotated Sentences.

–

Coherent English.

–

One complex word per sentence.

●

Difficult to generate automatically.

●

Expensive to generate manually.

●

So, we mine Simple Wikipedia Edit Histories.

http://lexicalsimplification.blogspot.co.uk

5
Simple Wikipedia
Edit Histories
●

Simple Wikipedia is:
–

An online encyclopedia.

–

Written in simplified English.

–

Collaboratively edited.

–

Available to download in XML format.

http://lexicalsimplification.blogspot.co.uk

6
Simple Wikipedia
Edit Histories
●

Simple Wikipedia is:
–

An online encyclopedia.

–

Written in simplified English.

–

Collaboratively edited.

–

Available to download in XML format.

●

Changes to articles recorded in edit histories.

●

Some changes are simplifications.

http://lexicalsimplification.blogspot.co.uk

6
Simple Wikipedia
Edit Histories
●

Advantages:
–

Fully automated

–

High throughput

–

Cost-effective

http://lexicalsimplification.blogspot.co.uk

7
Simple Wikipedia
Edit Histories
●

Advantages:

●

Disadvantages:

–

Fully automated

–

Content quality

–

High throughput

–

–

Cost-effective

Sparsity of
simplifications

–

Data exhaustion

http://lexicalsimplification.blogspot.co.uk

7
Mining – Extract Likely
Candidates
●

There are 2 stages to the mining process.

●

Stage 1:
–

2 adjacent revisions are selected.

http://lexicalsimplification.blogspot.co.uk

8
Mining – Extract Likely
Candidates
●

There are 2 stages to the mining process.

●

Stage 1:
–

2 adjacent revisions are selected.

–

A similarity score (TF-IDF) is calculated at sentence
level.

http://lexicalsimplification.blogspot.co.uk

8
Mining – Extract Likely
Candidates
●

There are 2 stages to the mining process.

●

Stage 1:
–

2 adjacent revisions are selected.

–

A similarity score (TF-IDF) is calculated at sentence
level.

–

High scoring pairs passed on.

–

All other pairs discarded.

http://lexicalsimplification.blogspot.co.uk

8
Mining – Validate
Candidates
●

There are 2 stages to the mining process.

●

Stage 2: A series of checks

http://lexicalsimplification.blogspot.co.uk

9
Mining – Validate
Candidates
●

There are 2 stages to the mining process.

●

Stage 2: A series of checks
–

One word difference.

http://lexicalsimplification.blogspot.co.uk

9
Mining – Validate
Candidates
●

There are 2 stages to the mining process.

●

Stage 2: A series of checks
–

One word difference.

–

Real words. (not: spam / vandalism / nonsense)

http://lexicalsimplification.blogspot.co.uk

9
Mining – Validate
Candidates
●

There are 2 stages to the mining process.

●

Stage 2: A series of checks
–

One word difference.

–

Real words. (not: spam / vandalism / nonsense)

–

Different stems.

http://lexicalsimplification.blogspot.co.uk

9
Mining – Validate
Candidates
●

There are 2 stages to the mining process.

●

Stage 2: A series of checks
–

One word difference.

–

Real words. (not: spam / vandalism / nonsense)

–

Different stems.

–

Synonyms.

http://lexicalsimplification.blogspot.co.uk

9
Mining – Validate
Candidates
●

There are 2 stages to the mining process.

●

Stage 2: A series of checks
–

One word difference.

–

Real words. (not: spam / vandalism / nonsense)

–

Different stems.

–

Synonyms.

–

Simplifying.

http://lexicalsimplification.blogspot.co.uk

9
Analysis
●

Six Annotators

●

Each given a 70 instance sample.

http://lexicalsimplification.blogspot.co.uk

10
Analysis
●

Six Annotators

●

Each given a 70 instance sample.
–

50 examples from the corpus (different for each).

–

20 common examples as a validation set.

http://lexicalsimplification.blogspot.co.uk

10
Analysis
●

Six Annotators

●

Each given a 70 instance sample.
–

50 examples from the corpus (different for each).

–

20 common examples as a validation set.

●

2 annotators ruled out by validation set.

●

Final corpus accuracy of: 97.5%.

http://lexicalsimplification.blogspot.co.uk

10
Experiments
●

Several experiments performed so far.

●

Presented at ACL Student Research Workshop.

http://lexicalsimplification.blogspot.co.uk

11
Experiments
●

Several experiments performed so far.

●

Presented at ACL Student Research Workshop.

●

3 techniques for identification were compared.

http://lexicalsimplification.blogspot.co.uk

11
Experiments
●

Several experiments performed so far.

●

Presented at ACL Student Research Workshop.

●

3 techniques for identification were compared.

●

Sophisticated strategies gave little or no
improvement over a baseline.

http://lexicalsimplification.blogspot.co.uk

11
Summary
●

Identifying Complex Words is important.

●

The CW Corpus lets us evaluate methods.

●

Preliminary results give little improvement.

http://lexicalsimplification.blogspot.co.uk
References
●

Corpus: http://tinyurl.com/cwcorpus

S. Devlin and J. Tait. The use of a psycholinguistic
database in the simplif cation of text for aphasic readers.
i
Linguistic Databases, p 161–173, 1998.
M. Yatskar, B. Pang, C. Danescu-Niculescu-Mizil, and L. Lee.
For the sake of simplicity: unsupervised extraction of
lexical simplif cations from Wikipedia. In HLT ’10 NAACL,
i
p 365–368, Stroudsburg, PA, USA, 2010.
http://lexicalsimplification.blogspot.co.uk

12
Any Questions
●

Corpus: http://tinyurl.com/cwcorpus

http://lexicalsimplification.blogspot.co.uk

13
Annotator Agreement
Annotator
Index
1

Kappa
1

Sample
Accuracy
98%

2

1

96%

3

0.4

70%

4

1

100%

5

0.6

84%

6

1

96%

http://lexicalsimplification.blogspot.co.uk
Example Discarded
Pairs
●

It was a _____ evening.

●

Nonsense Words (spelling correction)
–

●

Different Stems (sense correction)
–

●

Cooler → Cool

Synonymy (meaning change)
–

●

Cuol → Cool

Long → Cool

Simplifying

– Calm → Cool
http://lexicalsimplification.blogspot.co.uk

More Related Content

Similar to The CW Corpus PITR2013

Sattose 2020 presentation
Sattose 2020 presentationSattose 2020 presentation
Sattose 2020 presentationCéline Deknop
 
A Tale of Two Data Centers: Kafka Streams Resiliency (Anna McDonald, Confluen...
A Tale of Two Data Centers: Kafka Streams Resiliency (Anna McDonald, Confluen...A Tale of Two Data Centers: Kafka Streams Resiliency (Anna McDonald, Confluen...
A Tale of Two Data Centers: Kafka Streams Resiliency (Anna McDonald, Confluen...confluent
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsAnubhav Jain
 
Better Ruby Through Design Principles
Better Ruby Through Design PrinciplesBetter Ruby Through Design Principles
Better Ruby Through Design PrinciplesMike Gehard
 
Hooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQLHooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQLSamuel Lampa
 
Food Chains and Food Webs
Food Chains and Food WebsFood Chains and Food Webs
Food Chains and Food Webssth215
 
Apache Kafka® Delivers a Single Source of Truth for The New York Times
Apache Kafka® Delivers a Single Source of Truth for The New York TimesApache Kafka® Delivers a Single Source of Truth for The New York Times
Apache Kafka® Delivers a Single Source of Truth for The New York Timesconfluent
 
Technical writing
Technical writingTechnical writing
Technical writingpusthmus
 
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - NitinSolr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitinbloomreacheng
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Kris Jack
 

Similar to The CW Corpus PITR2013 (13)

Sattose 2020 presentation
Sattose 2020 presentationSattose 2020 presentation
Sattose 2020 presentation
 
A Tale of Two Data Centers: Kafka Streams Resiliency (Anna McDonald, Confluen...
A Tale of Two Data Centers: Kafka Streams Resiliency (Anna McDonald, Confluen...A Tale of Two Data Centers: Kafka Streams Resiliency (Anna McDonald, Confluen...
A Tale of Two Data Centers: Kafka Streams Resiliency (Anna McDonald, Confluen...
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
 
Better Ruby Through Design Principles
Better Ruby Through Design PrinciplesBetter Ruby Through Design Principles
Better Ruby Through Design Principles
 
Hooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQLHooking up Semantic MediaWiki with external tools via SPARQL
Hooking up Semantic MediaWiki with external tools via SPARQL
 
SEppt
SEpptSEppt
SEppt
 
Food Chains and Food Webs
Food Chains and Food WebsFood Chains and Food Webs
Food Chains and Food Webs
 
111.docx
111.docx111.docx
111.docx
 
Apache Kafka® Delivers a Single Source of Truth for The New York Times
Apache Kafka® Delivers a Single Source of Truth for The New York TimesApache Kafka® Delivers a Single Source of Truth for The New York Times
Apache Kafka® Delivers a Single Source of Truth for The New York Times
 
Technical writing
Technical writingTechnical writing
Technical writing
 
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - NitinSolr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
Solr Lucene Revolution 2014 - Solr Compute Cloud - Nitin
 
Well test analysis
Well test analysisWell test analysis
Well test analysis
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
 

Recently uploaded

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Recently uploaded (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

The CW Corpus PITR2013