AOL - Rao & Uppuluri - Hadoop World 2010

Hadoop Based Intelligent Text Processing System
October 12, 2010
Hadoop World, NYC

Who are we?
•Vaijanath N. Rao
•AOL
•Contact: vaijanath.rao@teamaol.com
•Rohini Uppuluri
•AOL
•Contact: rohini.uppuluri@teamaol.com

Agenda
1. Introduction
2. Problem Statement
3. Our Intelligent Text Processing System
1. Overview
2. Detailed
3. Application(s)
4. Q and A

Introduction( Continued…)
• Information Extraction - Extracting information From Text
• Part of Speech Analysis
Ex: BlackBeauty<noun> is<verb> a<det> pretty<adjective> horse<noun>
• Named Entity Extraction
Ex: The CEO <Person>Mr. A</Person> of <Location>New York</Location> based Firm
<Organization>Foo.Inc</Organization> announced its new Product
<date>today</date>
• Sentiment Analysis
Ex: Watch this film. AVATAR is an achievement in many technical departments. It is a
beautiful experience
• Sentence Detection
Ex: <Start Sentence>BlackBeauty is a pretty horse <End Sentence>
• Some Tools: OpenNLP[5], LingPipe[6], GATE[7], NLTK[8] etc
• Categorization/Classification - Categorize items into one of the predefined
classes
Ex: An article talking about some baseball match is a “Sports” article.

Introduction (Continued…)
• Challenges
• Processing large amount of data
• Most approaches use machine learning methods
• Need to be trained on large amount of data
• Need to way to perform the computations in a scalable manner
• Domain Dependency

Problem Statement
• What we want to do?
• Build Large Scale applications (processing text)
• Why is this useful?
• Analyze Large Content available at AOL
• Applications: User interests Mining, Ad Targeting, Personalization etc
• We need
• A Large Scale NLP System
• A Pipeline sort of architecture with users being able to plug in or out
components
• Abstraction or Transparency of the algorithms used as requested by the user

Our Intelligent
Text Processing System
• Overview
• Pipelined Architecture
• Pluggable components
• Work Flow Manager
• Recovery Manager
• Job Manager
• Applications
• Large Scale Applications using scalable way of applying NLP Models

Job Manager
•Creates series of parallel and sequential dependent jobs (takes configuration
file)
•Example :
Jobs A, B, C, D, E and F
Job B depends on Job A ; Job E depends on D
•Job manager creates following Tree
•Jobs A,D and F are executed parallel
•Jobs B and E will be executed parallel depending upon there parent jobs
completion.

Recovery Manager
•Each job writes the configuration, start time, end time (
if completed) into the status file
•Periodically checks for the status file updates to see if
any job failed, if so restarts the job, by calling the Job
manager with required configuration

Sample Configuration
<job name="keyphrase">
<mapreduce depends="none" name="postagger">
<inputargs>input arguments as string</inputargs>
<output>$hdfsoutputLocation</output>
<jar>postagger.jar</jar>
<mainClass>com.aol.datalayer.nlp.postagger</mainClass>
</mapreduce>
<mapreduce depends="postagger" name="nounphrase">
<inputargs>input arguments as string</inputargs>
<output>$hdfsoutputlocation</output>
<jar>chunker.jar</jar>
<mainClass>com.aol.datalayer.nlp.chunker</mainClass>
</mapreduce>
</job>

Application 1- Location Aware Contextual Advertising -
Example

Location Aware Contextual Advertising- Overview

Application 2- User Aware Ad Targetting - Example
This is an illustrative example and does not represent any real user

User Aware Ad Targetting

Conclusions
• Pipelined Architecture
• NLP System
• Large Scale Applications
• Location aware Contextual Ad Targetting
• User aware Ad targetting

Future Work
• Developing distributed algorithms for
• POS Tagger
• Sentiment Analyzer models
• Exploring if it might be useful integrating with any
open source distributed ML/TM framework

References
1. Part-of-Speech Tagging: en.wikipedia.org/wiki/Part-of-
speech_tagging
2. Coreference Resolution: en.wikipedia.org/wiki/Coreference
3. Named Entity Recognition:
en.wikipedia.org/wiki/Named_entity_recognition
4. Sentiment
Analysis:en.wikipedia.org/wiki/Sentiment_analysis
5. Open NLP: http://opennlp.sourceforge.net/
6. LingPipe: http://alias-i.com/lingpipe/
7. GATE: http://gate.ac.uk/ie/
8. NLTK: www.nltk.org

AOL - Rao & Uppuluri - Hadoop World 2010

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (7)

Similaire à AOL - Rao & Uppuluri - Hadoop World 2010

Similaire à AOL - Rao & Uppuluri - Hadoop World 2010 (20)

Plus de Cloudera, Inc.

Plus de Cloudera, Inc. (20)

Dernier

Dernier (20)

AOL - Rao & Uppuluri - Hadoop World 2010