Knowledge extraction from the Encyclopedia of Life using Python NLTK

Knowledge extraction from
the Encyclopedia of Life
Using Python NLTK
Anne Thessen
annethessen@gmail.com

Challenges
Eastern Lowland Gorilla
Gorilla berengei
Gorilla beringei mikenensis
Gorilla gorilla

Gorilla beringei
Matschie

King kong
ゴリラ
Gorille

大猩猩
Virunga
Горилла

Gorilla
graueri
Koko

Mountain
gorilla

Guerilla

Gorila

Challenges

Contextual data
Primate
Monkey
Eyes
Food
Panama
Aotus nancymaae

Contextual data

Disambiguate by
authority, species,
contextual data

Legume
Plant
Flower
Mirbeliea
Australia
Aotus mollis

Beautiful Soup

GNRD

Resolver

• Common names
• Interaction type

Python NLTK
• http://nltk.org/book/
• http://nltk.org/
• Install NLTK and NLTK Data

• Natural Language Processing (NLP)

• Natural Language Processing (NLP)
• Semantic Statistics
Robin is the name of several fictional
characters appearing in comic
books published by DC Comics,
originally created by Bob Kane, Bill
Finger and Jerry Robinson, as a junior
counterpart to DC
Comics superhero Batman. The team of
Batman and Robin is commonly
referred to as the Dynamic Duo or
the Caped Crusaders.

The American Robin is active mostly
during the day and assembles in large
flocks at night. It is one of the earliest
bird species to lay eggs, beginning to
breed shortly after returning to its
summer range from its winter range. Its
nest consists of long coarse grass,
twigs, paper, and feathers, and is
smeared with mud and often
cushioned with grass or other soft
materials. It is among the first birds to
sing at dawn.

Robin is the name of several fictional
characters appearing in comic
books published by DC Comics,
originally created by Bob Kane, Bill
Finger and Jerry Robinson, as a junior
counterpart to DC
Comics superhero Batman. The team of
Batman and Robin is commonly
referred to as the Dynamic Duo or
the Caped Crusaders.

•
•
•
•
•
•
•

fictional
comic books
Bob Kane
superhero
Batman
Dynamic Duo
Caped Crusaders

The American Robin is active mostly
during the day and assembles in large
flocks at night. It is one of the earliest
bird species to lay eggs, beginning to
breed shortly after returning to its
summer range from its winter range. Its
nest consists of long coarse grass,
twigs, paper, and feathers, and is
smeared with mud and often
cushioned with grass or other soft
materials. It is among the first birds to
sing at dawn.

•
•
•
•
•
•

flocks
bird
eggs
nest
sing
species

From GNRD
names_list = *“Pandarus sinuatus”,“Pandarus smithii”+
genera = []
for name in name_list:
row = name.split(‘ ‘)
genera.append(row[0])
genera = *“Pandarus”,”Pandarus”+

genera = *“Pandarus”,”Pandarus”+
i = -1
genus_index_list = []
for genus in genera:
genus_text = tokens[i+1:]
genus_index = genus_text.index(genus)
if i == -1:
genus_index_list.append(genus_index)
else:
genus_index = genus_index + i + 1
genus_index_list.append(genus_index)
i = genus_index
genus_index = [36,39]

genus_index = [36,39]
for index in genus_index_list:
species = *‘ ‘.join(tokens*index:index+2])]
#Join the genus to the word immediately following.
if species == name_list[counter]:
#Does this match the name_list?
tokens[index:index+2+ = *‘ ‘.join(tokens*index:index+2])]
#If yes, combine the two into one element

tokens =
*‘Great’, ‘white’, ‘sharks’, ‘are’, ‘apex’, ‘pre
dators’, ‘,’, ‘meaning’, ‘they’, ‘have’, ‘a’, ‘lar
ge’, ‘effect’, ‘on’, ‘the’, ‘populations’, ‘of’, ‘t
heir’, ‘prey’, ‘including’, ‘elephant’, ‘seals’,
‘and’, ‘sea’, ‘lions.’, ‘Great’, ‘white’, ‘sharks’
, ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’, ‘c
opepods’, ‘(‘, ‘Pandarus
sinuatus’, ‘and’, ‘Pandarus smithii’, ‘)’, ‘.’+

name_index_list = [36,38]
Looking at the first relationship:

Carcharodon carcharias

Pandarus sinuatus

term_list = []
for name_index in name_index_list:
term_list = tokens[name_index-10:name_index+10]
term_list =
*‘white’, ‘sharks’, ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’,
‘copepods’, ‘(‘, ‘Pandarus sinuatus’, ‘and’, ‘Pandarus

Looking at the first relationship:

Parasite/host
Carcharodon carcharias

Pandarus sinuatus

term_list =
*‘white’, ‘sharks’, ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’,
‘copepods’, ‘(‘, ‘Pandarus sinuatus’, ‘and’, ‘Pandarus
smithii’, ‘)’, ‘.’+

Training Data
• Show the algorithm what “parasite/host”
words look like
• Compare to an unknown
• We want “Document Classification”
• Brown, Reuters and Movie Review
• We need to make our own corpus

Creating a Categorized Text Corpus
• http://www.packtpub.com/article/pythontext-processing-nltk-20-creating-customcorpora
• Inside “corpus” folder create new folder for
your corpus. Mine is “eco”.
• Build your corpus (start with EOL text)
• Make a category specification
• Lets start with parasitism and predation

Creating a Categorized Text Corpus
• eco
–
–
–
–
–
–
–
–

lion1
lion2
lion3
shark1
shark2
shark3
…
cats.txt

• in cats.txt
lion1.txt predation
lion2.txt parasitism
…

from nltk.corpus.reader import CategorizedPlaintextCorpusReader
corpus_root = ‘/Users/athessen/nltk_data/corpora/eco’
reader = CategorizedPlaintextCorpusReader(corpus_root,r’lion|sharkd*.txt’,cat_file=‘cats.txt’)


Choose a Corpus Reader


Choose a Corpus Reader
You have to tell this Corpus Reader
Corpus root directory
File names (aka fileids)
Category specification

Next Steps
• Build corpus
• Build Feature Extractor
• Train Classifier

Knowledge extraction from the Encyclopedia of Life using Python NLTK

Recommended

Recommended

More Related Content

More from Anne Thessen

More from Anne Thessen (13)

Recently uploaded

Recently uploaded (20)

Knowledge extraction from the Encyclopedia of Life using Python NLTK