This presentation demonstrates the potential for NLTK to extract information about ecological species interactions from text in EOL. It was presented Nov 12, 2013 at the Startup Institute in Cambridge, MA for the Boston PyLadies monthly meeting.
13. • Natural Language Processing (NLP)
• Semantic Statistics
Robin is the name of several fictional
characters appearing in comic
books published by DC Comics,
originally created by Bob Kane, Bill
Finger and Jerry Robinson, as a junior
counterpart to DC
Comics superhero Batman. The team of
Batman and Robin is commonly
referred to as the Dynamic Duo or
the Caped Crusaders.
The American Robin is active mostly
during the day and assembles in large
flocks at night. It is one of the earliest
bird species to lay eggs, beginning to
breed shortly after returning to its
summer range from its winter range. Its
nest consists of long coarse grass,
twigs, paper, and feathers, and is
smeared with mud and often
cushioned with grass or other soft
materials. It is among the first birds to
sing at dawn.
14. Robin is the name of several fictional
characters appearing in comic
books published by DC Comics,
originally created by Bob Kane, Bill
Finger and Jerry Robinson, as a junior
counterpart to DC
Comics superhero Batman. The team of
Batman and Robin is commonly
referred to as the Dynamic Duo or
the Caped Crusaders.
•
•
•
•
•
•
•
fictional
comic books
Bob Kane
superhero
Batman
Dynamic Duo
Caped Crusaders
The American Robin is active mostly
during the day and assembles in large
flocks at night. It is one of the earliest
bird species to lay eggs, beginning to
breed shortly after returning to its
summer range from its winter range. Its
nest consists of long coarse grass,
twigs, paper, and feathers, and is
smeared with mud and often
cushioned with grass or other soft
materials. It is among the first birds to
sing at dawn.
•
•
•
•
•
•
flocks
bird
eggs
nest
sing
species
18. From GNRD
names_list = *“Pandarus sinuatus”,“Pandarus smithii”+
genera = []
for name in name_list:
row = name.split(‘ ‘)
genera.append(row[0])
genera = *“Pandarus”,”Pandarus”+
19. genera = *“Pandarus”,”Pandarus”+
i = -1
genus_index_list = []
for genus in genera:
genus_text = tokens[i+1:]
genus_index = genus_text.index(genus)
if i == -1:
genus_index_list.append(genus_index)
else:
genus_index = genus_index + i + 1
genus_index_list.append(genus_index)
i = genus_index
genus_index = [36,39]
20. genus_index = [36,39]
for index in genus_index_list:
species = *‘ ‘.join(tokens*index:index+2])]
#Join the genus to the word immediately following.
if species == name_list[counter]:
#Does this match the name_list?
tokens[index:index+2+ = *‘ ‘.join(tokens*index:index+2])]
#If yes, combine the two into one element
22. name_index_list = [36,38]
Looking at the first relationship:
Carcharodon carcharias
Pandarus sinuatus
term_list = []
for name_index in name_index_list:
term_list = tokens[name_index-10:name_index+10]
term_list =
*‘white’, ‘sharks’, ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’,
‘copepods’, ‘(‘, ‘Pandarus sinuatus’, ‘and’, ‘Pandarus
23. Looking at the first relationship:
Parasite/host
Carcharodon carcharias
Pandarus sinuatus
term_list =
*‘white’, ‘sharks’, ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’,
‘copepods’, ‘(‘, ‘Pandarus sinuatus’, ‘and’, ‘Pandarus
smithii’, ‘)’, ‘.’+
24. Training Data
• Show the algorithm what “parasite/host”
words look like
• Compare to an unknown
• We want “Document Classification”
• Brown, Reuters and Movie Review
• We need to make our own corpus
25. Creating a Categorized Text Corpus
• http://www.packtpub.com/article/pythontext-processing-nltk-20-creating-customcorpora
• Inside “corpus” folder create new folder for
your corpus. Mine is “eco”.
• Build your corpus (start with EOL text)
• Make a category specification
• Lets start with parasitism and predation
26. Creating a Categorized Text Corpus
• eco
–
–
–
–
–
–
–
–
lion1
lion2
lion3
shark1
shark2
shark3
…
cats.txt
• in cats.txt
lion1.txt predation
lion2.txt parasitism
…
28. from nltk.corpus.reader import CategorizedPlaintextCorpusReader
corpus_root = ‘/Users/athessen/nltk_data/corpora/eco’
reader = CategorizedPlaintextCorpusReader(corpus_root,r’lion|sharkd*.txt’,cat_file=‘cats.txt’)
Choose a Corpus Reader
29. from nltk.corpus.reader import CategorizedPlaintextCorpusReader
corpus_root = ‘/Users/athessen/nltk_data/corpora/eco’
reader = CategorizedPlaintextCorpusReader(corpus_root,r’lion|sharkd*.txt’,cat_file=‘cats.txt’)
Choose a Corpus Reader
You have to tell this Corpus Reader
Corpus root directory
File names (aka fileids)
Category specification