5. Page 5
Introduction( Continued…)
• Information Extraction - Extracting information From Text
• Part of Speech Analysis
Ex: BlackBeauty<noun> is<verb> a<det> pretty<adjective> horse<noun>
• Named Entity Extraction
Ex: The CEO <Person>Mr. A</Person> of <Location>New York</Location> based Firm
<Organization>Foo.Inc</Organization> announced its new Product
<date>today</date>
• Sentiment Analysis
Ex: Watch this film. AVATAR is an achievement in many technical departments. It is a
beautiful experience
• Sentence Detection
Ex: <Start Sentence>BlackBeauty is a pretty horse <End Sentence>
• Some Tools: OpenNLP[5], LingPipe[6], GATE[7], NLTK[8] etc
• Categorization/Classification - Categorize items into one of the predefined
classes
Ex: An article talking about some baseball match is a “Sports” article.
6. Page 6
Introduction (Continued…)
• Challenges
• Processing large amount of data
• Most approaches use machine learning methods
• Need to be trained on large amount of data
• Need to way to perform the computations in a scalable manner
• Domain Dependency
7. Page 7
Problem Statement
• What we want to do?
• Build Large Scale applications (processing text)
• Why is this useful?
• Analyze Large Content available at AOL
• Applications: User interests Mining, Ad Targeting, Personalization etc
• We need
• A Large Scale NLP System
• A Pipeline sort of architecture with users being able to plug in or out
components
• Abstraction or Transparency of the algorithms used as requested by the user
8. Page 8
Our Intelligent
Text Processing System
• Overview
• Pipelined Architecture
• Pluggable components
• Work Flow Manager
• Recovery Manager
• Job Manager
• Applications
• Large Scale Applications using scalable way of applying NLP Models
10. Page 10
Job Manager
•Creates series of parallel and sequential dependent jobs (takes configuration
file)
•Example :
Jobs A, B, C, D, E and F
Job B depends on Job A ; Job E depends on D
•Job manager creates following Tree
•Jobs A,D and F are executed parallel
•Jobs B and E will be executed parallel depending upon there parent jobs
completion.
11. Page 11
Recovery Manager
•Each job writes the configuration, start time, end time (
if completed) into the status file
•Periodically checks for the status file updates to see if
any job failed, if so restarts the job, by calling the Job
manager with required configuration
21. Page 21
Conclusions
• Pipelined Architecture
• NLP System
• Large Scale Applications
• Location aware Contextual Ad Targetting
• User aware Ad targetting
22. Page 22
Future Work
• Developing distributed algorithms for
• POS Tagger
• Sentiment Analyzer models
• Exploring if it might be useful integrating with any
open source distributed ML/TM framework