SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Experiments in data mining, entity disambiguation
and how to think data-structures for designing
beautiful algorithms
Ekta Grover
1st September, 2013
@ektagrover/ekta1007@gmail.com PyCon India, 2013 1/26
Structure of this Talk
How to think through a data mining problem
3 problems in data-mining around disambiguation, Natural
language processing & Information Retrieval
Measuring your Model performance - developing a near-real
time performance aggregation platform
Putting it all together
@ektagrover/ekta1007@gmail.com PyCon India, 2013 2/26
The only algorithm you will ever need to know
1 Write down the problem
2 Think real hard
3 Write down the solution
@ektagrover/ekta1007@gmail.com PyCon India, 2013 3/26
Thinking through Data-Mining problem
What is the data and how should you represent it ?
What are the relationships you are interested in ?
How to define your Objective Function ?
How will you know that the score represents the direction you
want to capture ?
How will you measure the performance after you are done ?
@ektagrover/ekta1007@gmail.com PyCon India, 2013 4/26
Problem # 1 : Entity Disambiguation
Problem
So, How much is your network really worth ?
What is Disambiguation ?
Information = Related + Unrelated
Extract significant attributes/features of an entity
Relationships between entities & features
Objective function is a just a way to measure - rank/score the
features for an entity wrt to other entities
Disambiguation is not a deterministic problem
@ektagrover/ekta1007@gmail.com PyCon India, 2013 5/26
Problem # 1 : Entity Disambiguation
Guiding question #1
What form of disambiguation to use for the entities ?
Algorithm & approach
1 Preprocessing - lower casing & removing punctuation
2 Exploratory Analysis - frequency plots (with NLTK freqdist &
Pretty table)
3 Tokenization to find ”keywords” to disambiguate
4 Handling stop words & Internationalization
5 Frequent item-set mining with support = #of buckets
6 Post processing, frequency plots & visualization
Extensions: Handling Typo graphical errors and augmenting this
data-set with scraping
@ektagrover/ekta1007@gmail.com PyCon India, 2013 6/26
Problem # 1 : Talking Specifics & answering the ”Why”
Frequent item-set mining & concept of support
How do we know the number of significant words in an entity name?
1 For every company name - strip the 1st
word and store it as the key
of ”local” dictionary, with values as list
2 On the local dict created in step 1, do frequent item-set mining
with support = # baskets - if a word is significant(and uniquely
describes an entity), it should be present in all baskets
3 Treat all the values in a local dict as a set - If all elements of a local
dict are same - no need for item-set mining
4 For every key - find the smallest list of tokens(by length) and search
the presence of all tokens iteratively till the support appears in less
than #baskets
5 Replace the words fetched post support to equal all values in that
dictionary
6 Fetch all values from all keys in the dict of list - flatten this list
7 Regular frequency counts on this flattened list of now
disambiguated elements
@ektagrover/ekta1007@gmail.com PyCon India, 2013 7/26
Problem # 1 : Limitations of this approach
1 Abbreviations : Boston Consulting Group & BCG, State Bank of
India & SBI - build a referencing domain specific Word Sense
Disambiguation (WSD) database
2 Differently referenced entities such as ”Innovation labs 247
customer private ltd” and ”247 innovation labs” : The referencing
key will be different for these
3 Sub-entities of a company with punctuation ”Ebay/Paypal” - the
”key” depends on how we choose to treat the punctuation for {/,:,-}
4 Need a module to handle inadvertent typographical errors in
entity names, names are proper nouns, yet no referencing from
Wordnet database - plug in Problem #2
Extensions
Cross document and Co-reference resolution,Multi-pass learning, smooth-
ing & annealed learning algorithms
@ektagrover/ekta1007@gmail.com PyCon India, 2013 8/26
Problem # 1 : Representation
@ektagrover/ekta1007@gmail.com PyCon India, 2013 9/26
Problem # 1 : Representation, a more extreme example &
Codewalk
Codewalk from Github
https://github.com/ekta1007/Experiments-in-Data-mining
@ektagrover/ekta1007@gmail.com PyCon India, 2013 10/26
Problem # 1 : Visualization
YYYYYYYSAPY[Y57Y]YYY
[24]7,Inc6[37]
VMware[YzZY]
IbmYY[YZ4Y]
GoogleY[YZ5Y]
ThoughtworksY[YZBY]Yahoo Y[ ZBY]
EbayYincY[Y8Y]
IntuitY[Y8Y]
MicrosoftYCorporationY[Y8Y]
Accenture66[676]
LinkedinY[Y7Y]
University6of6Cologne6[676]
DellY[Y6Y]
Deutsche6Bank6[666]
FlipkartFcomY[Y6Y]
BostonYConsultingYGroupY[Y6Y]
ZyngaY[Y6Y]
AmazonY[Y5Y]
Anita6Borg6Institute6for6Women6&Technology6[656]
Top companies by Frequency distribution, post item-set mining
*I used R(ggplot2 and igraph) to plot the nodes with ColorBrewer palette and Inkspace for ”cosmetics”
@ektagrover/ekta1007@gmail.com PyCon India, 2013 11/26
Problem # 2 : Custom Distance metric for handling
Typographical errors optimized for a QWERTY keyboard
Usecase
Disambiguating the typographical errors with a contrasting (noun)item
Three types of misspellings1
1 Typographic errors - know the correct spelling, but motor
coordination slip when typing
2 Cognitive errors - caused by lack of knowledge of the person
3 Phonetic errors2
- sound correct, but are orthographically incorrect
Distance measures: Levenshtein, Manhattan, Euclidean, Cosine -
inappropriate for distance in our context. Need a hybrid contextual
approach
1
Karen Kukich - Techniques for automatically correcting words in text, ACM Computing Surveys Volume 24
Issue 4, Dec. 1992 ; Creating a Spellchecker with Solr http://emmaespina.wordpress.com/2011/01/31/20/
2
Soundex algorithm; example : Chebyshev & Tchebycheff
@ektagrover/ekta1007@gmail.com PyCon India, 2013 12/26
Problem # 2 : QWERTY distance metric
Guiding Assumption
Since typographical errors are inadvertently introduced, they should not
count towards the overall distance
Algorithm & approach?
1 Data structures - QWERTY keyboard as a graph with connected
adjacent nodes, neighborhood nodes with appropriate distances
2 Define a typographical error & it’s thresholds- what can be the
difference in length of words, how many possible ”swaps” can occur
in a typo
3 Implement the custom distance metric - fall back to Levenshtein
when non-typographical - Hybrid approach
4 Output whether the words are identical, or the distance taking
typographical errors into account
5 Contrast and blend in with other fuzzy logic methods, Apache
Solr, Spell checker & Auto-suggest
@ektagrover/ekta1007@gmail.com PyCon India, 2013 13/26
Problem # 2 : Representation & Codewalk
QWERTY Keyboard representation
Codewalk from Github
https://github.com/ekta1007/Custom-Distance-function-for-typos-in-
hand-generated-datasets-with-QWERY-Keyboard/
@ektagrover/ekta1007@gmail.com PyCon India, 2013 14/26
Problem # 3 : Designing your custom job feeds at Linkedin
Part 1
Create filtering criteria to build your custom data-set
Linkedin REST API’s, scraping(Scrapy, Beautiful soup), raw html with
urllib & NLTK
Part 2
Build relevance score of the filtered results
TF-IDF(with cosine similarity), Structured Relevance Models (SRM), Prox-
imity search, classical Information Retrieval & pattern matching algorithms
@ektagrover/ekta1007@gmail.com PyCon India, 2013 15/26
Problem # 3 : Part 1
Algorithm & approach
1 Exploit the commonality in url pattern of the Jobs page, looping
constructs for job index http://www.linkedin.com/jobs?
viewJob=&jobId=job_index&trk=rj_jshp
2 Post this only 4 options exist per Job page
1 The job you’re looking for is no longer active
2 We can’t find the job you’re looking for
3 The job passes your filtering criteria (location, skill-specific
keywords etc)
4 The job is active, but does not pass your filtering criteria
3 For urls in 2.3, build a simple csv file with all elements of interest
- Company Name, Title of Position, Job Posting Url, Posted on,
Functions, Industry, similar Jobs
4 Can pre-filter by company name - by heuristics, or a quick scoring
function for each company name
@ektagrover/ekta1007@gmail.com PyCon India, 2013 16/26
Problem # 3 : Part 2
Algorithm & Approach
1 A document is a bag-of-words - Tokenize, standard
preprocessing(punctuation, lower case)
2 Exploratory analysis with NLTK’s Dispersion plot, FreqDist,
collocations, common contexts, count, concordance
3 Lemmatization & stemming with NLTK’s derivationally related
forms with Morphy(Wordnet), Porter, Wordnet app
4 TF-IDF - with stop word list from NLTK & additional weighage for
words of your specific interest
5 Inverted index - for each word in vocabulary, store the doc-id with
it’s TF-IDF, making up the whole TF-IDF vector
6 Given the TF-IDF vector of the document corpus & query, compute
the cosine similarity of each document with the base query
7 Output top-k relevant Job listings, with the top-k’ words in each
job feed
8 Augment this with a classical structured relevance model,
Proximity search or pattern matching algorithm
@ektagrover/ekta1007@gmail.com PyCon India, 2013 17/26
Problem # 3 : Representation & Visualization
BusinessAnalysisSr. ManageratDell inBangalore[ 0.99374331]
SrI6Software6Development6Engineer6x6Architect6−6Ad6TechFAmazons6[65IbWWGHNNL6]
Technical6Program6Manager6−6Amazon6Fashion6TechnologyFAmazons6[65IbWCbW5WGG6]
Data6AnalyticsxBusiness6Intelligence6Engineer6−6Buying6ExperienceFAmazons6[65Ib:bC55WL:6]
SrI6Software6Development6Engineer6P6Collective6Human6IntelligenceFAmazons6[65IbHW:HGW::6]
Lead6Analyst,6Statistical6Modelling6FWNS6Global6Servicess6[65Ib5GCNWCL6]
Principal6ScientistB6SrI6Computer6ScientistB6Member6of6Research6StaffFAdobes6[65IG&5&WHL5G6]
Research6ScientistFYahoos6[65INb5L:HCNG6]
RPD−Biologics−Sr6ScientistFBiocons6[65INNNCHLWC&6]
Data6ScientistFNetApps6[65I:N555H5WW6]
Relevant Job feeds ranked by Cosine Similarity & TF-IDF to the base
query
@ektagrover/ekta1007@gmail.com PyCon India, 2013 18/26
Problem # 3 : Representation & Codewalk
dell5[50.023801025]
procurement5[50.0230164475]
develops5[50.0104620215]
strategy5[50.0104129465]
goals5[50.0083696175]
reports5[50.0083696175]
analyzes5[50.0083696175]
internal5[50.0062772135]
planning5[50.0062772135]
BusinessdAnalysisdSr.dManager(Dell)
advertising5[50.0157264585]
highly5[50.0088483225]
amazon5[50.0070883285]
traffic5[50.0066362425]
advertisement5[50.0066362425]
increase5[50.0066362425]
architect5[50.0065324785]
ad5[50.0059930035]
distributed5[50.0047944025]
Sr.dSoftwaredDevelopmentdEngineerd/dArchitect-AddTech(Amazon)
nextorbit5[50.0478708685]
energy5[50.0157405285]
discovering5[50.0130556915]
understanding5[50.0110156185]
scientist5[50.011000655]
enable5[50.0094321815]
space5[50.009281755]
trends5[50.009281755]
Sr.dEnergydDatadScientist(NextOrbitdInc)
Top ranking words in the respective job-feeds3
Codewalk from Github
https://github.com/ekta1007/Creating-Custom-job-feeds-for-Linkedin
3
Note that some of the words are relatively uninformative like ”understanding”, ”Amazon” - could use any
amongst Proximity search, Phase search, Positional index, Next word index
@ektagrover/ekta1007@gmail.com PyCon India, 2013 19/26
Problem # 4 : Designing a near-real time Performance
Aggregation Platform
Concept
Once you roll your competing candidate models into production, they
should self-decide based on performance
Competing Goals can be amongst
1 Minimizing entropy of the complete system(more centered
distribution)
2 Optimizing precision & recall4
- tradeoff between type I & type II
3 Accounting for trends - 1st
order(velocity) and 2nd
(acceleration)
4 Minimize bias - Low mean square error, moving towards a more
global representation of a problem
4
precision - fraction of retrieved instances that are relevant, recall - fraction of relevant instances that are
retrieved
@ektagrover/ekta1007@gmail.com PyCon India, 2013 20/26
Problem # 4 : Defining the aggregated proxy for
Performance
Algorithm & approach
1 Apriori: Initialize the candidate population, all else equal
2 Define model equations, which generate the scores, per model
3 Define the metrics for weighted performance and what is
considered a ”success event”
4 Define threshold conditions and measure for success for each
metric, wherever applicable
5 Scale the aggregated performance to have the next period
candidate population
6 Roll out the new candidate population to next period, per model
(Probabilistic model)
Scores are centrality scores(Welford’s Algorithm) - not to be over-
enthusiastic(pessimistic) for state-outcomes and weights for recency of
performance, over long time learning cycles
@ektagrover/ekta1007@gmail.com PyCon India, 2013 21/26
Problem # 4 : Designing the performance benchmark
Proxy for bias Type I & Type II errors Measure of entropy Velocity Metric Acceleration Metric
Time period Model # Mean Square Precision Recall Mean Mean Variance Velocity Velocity Acceleration Acceleration
bench-marked against Error Score Score by Metric 1 by Metric 2 by Metric 1 by Metric 2
t1 M1
M2
M3
M4
M5
t2 M1
M2
M3
M4
M5
@ektagrover/ekta1007@gmail.com PyCon India, 2013 22/26
Problem # 4 : Representation & Codewalk
Aggregated performance in Period 1
Codewalk from Github
https://github.com/ekta1007/Performance-aggregation-platform—
learning-in-near-real-time
@ektagrover/ekta1007@gmail.com PyCon India, 2013 23/26
Problem # 4 : Representation cont..
Aggregated performance in Period 1 & 2 respectively
@ektagrover/ekta1007@gmail.com PyCon India, 2013 24/26
Resources
Link to the problems I talked today(I keep updating this)
Understanding data structures in Python
A Programmer’s Guide to Data Mining - Ron Zacharski
Creating a Spell Checker with Solr
An Intuitive example to TF-IDF, Iraq War logs - Jonathan Stray
An Introduction to Information Retrieval - Christopher D. Manning, Prabhakar Raghavan, Hinrich Schutze
Natural Language Processing with Python - Steven Bird, Ewan Klein, Edward Loper
Stemming & lemmatization
Programming Collective Intelligence: Building Smart Web 2.0 Applications - Toby Segaran
Mining the Social Web: Analyzing Data from Facebook, Twitter, LinkedIn - Matthew A. Russell
@ektagrover/ekta1007@gmail.com PyCon India, 2013 25/26
@ektagrover/ekta1007@gmail.com PyCon India, 2013 26/26

Contenu connexe

Tendances

Is Text Search an Effective Approach for Fault Localization: A Practitioners ...
Is Text Search an Effective Approach for Fault Localization: A Practitioners ...Is Text Search an Effective Approach for Fault Localization: A Practitioners ...
Is Text Search an Effective Approach for Fault Localization: A Practitioners ...Debdoot Mukherjee
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation EnginesTrey Grainger
 
Lemur Tutorial at SIGIR 2006
Lemur Tutorial at SIGIR 2006Lemur Tutorial at SIGIR 2006
Lemur Tutorial at SIGIR 2006pogil
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesTrey Grainger
 
Overview of entity framework by software outsourcing company india
Overview of entity framework by software outsourcing company indiaOverview of entity framework by software outsourcing company india
Overview of entity framework by software outsourcing company indiaJignesh Aakoliya
 
Tecnical Apps Interview Questions
Tecnical Apps Interview QuestionsTecnical Apps Interview Questions
Tecnical Apps Interview QuestionsHossam El-Faxe
 

Tendances (11)

BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3
 
Is Text Search an Effective Approach for Fault Localization: A Practitioners ...
Is Text Search an Effective Approach for Fault Localization: A Practitioners ...Is Text Search an Effective Approach for Fault Localization: A Practitioners ...
Is Text Search an Effective Approach for Fault Localization: A Practitioners ...
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 
Lemur Tutorial at SIGIR 2006
Lemur Tutorial at SIGIR 2006Lemur Tutorial at SIGIR 2006
Lemur Tutorial at SIGIR 2006
 
Software Architecture - Quiz Questions
Software Architecture - Quiz QuestionsSoftware Architecture - Quiz Questions
Software Architecture - Quiz Questions
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Vespa, A Tour
Vespa, A TourVespa, A Tour
Vespa, A Tour
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation Engines
 
Haystacks slides
Haystacks slidesHaystacks slides
Haystacks slides
 
Overview of entity framework by software outsourcing company india
Overview of entity framework by software outsourcing company indiaOverview of entity framework by software outsourcing company india
Overview of entity framework by software outsourcing company india
 
Tecnical Apps Interview Questions
Tecnical Apps Interview QuestionsTecnical Apps Interview Questions
Tecnical Apps Interview Questions
 

En vedette

Integrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML DatabaseIntegrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML Databaselucenerevolution
 
Java Search Engine Framework
Java Search Engine FrameworkJava Search Engine Framework
Java Search Engine FrameworkAppsterdam Milan
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)Kira
 
Distance measure between two biological sequences
Distance  measure between  two biological  sequences Distance  measure between  two biological  sequences
Distance measure between two biological sequences ShwetA Kumari
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with LuceneWO Community
 
FOSDEM (feb 2011) - A real-time search engine with Lucene and S4
FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4
FOSDEM (feb 2011) - A real-time search engine with Lucene and S4Michaël Figuière
 
Personal Matching Recommendation system in TinderBox
Personal Matching Recommendation system in TinderBoxPersonal Matching Recommendation system in TinderBox
Personal Matching Recommendation system in TinderBoxMad Scientists
 
Architecture of a search engine
Architecture of a search engineArchitecture of a search engine
Architecture of a search engineSylvain Utard
 
Fast detection of transformed data leaks[mithun_p_c]
Fast detection of transformed data leaks[mithun_p_c]Fast detection of transformed data leaks[mithun_p_c]
Fast detection of transformed data leaks[mithun_p_c]MithunPChandra
 
Basics of Search Engines and Algorithms
Basics of Search Engines and AlgorithmsBasics of Search Engines and Algorithms
Basics of Search Engines and AlgorithmsWeb Trainings Academy
 
Routing algorithm
Routing algorithmRouting algorithm
Routing algorithmfarimoin
 

En vedette (13)

Integrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML DatabaseIntegrating Lucene into a Transactional XML Database
Integrating Lucene into a Transactional XML Database
 
Approximate string comparators
Approximate string comparatorsApproximate string comparators
Approximate string comparators
 
10-1 Vocab of Terms
10-1 Vocab of Terms10-1 Vocab of Terms
10-1 Vocab of Terms
 
Java Search Engine Framework
Java Search Engine FrameworkJava Search Engine Framework
Java Search Engine Framework
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Distance measure between two biological sequences
Distance  measure between  two biological  sequences Distance  measure between  two biological  sequences
Distance measure between two biological sequences
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
FOSDEM (feb 2011) - A real-time search engine with Lucene and S4
FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4FOSDEM (feb 2011) -  A real-time search engine with Lucene and S4
FOSDEM (feb 2011) - A real-time search engine with Lucene and S4
 
Personal Matching Recommendation system in TinderBox
Personal Matching Recommendation system in TinderBoxPersonal Matching Recommendation system in TinderBox
Personal Matching Recommendation system in TinderBox
 
Architecture of a search engine
Architecture of a search engineArchitecture of a search engine
Architecture of a search engine
 
Fast detection of transformed data leaks[mithun_p_c]
Fast detection of transformed data leaks[mithun_p_c]Fast detection of transformed data leaks[mithun_p_c]
Fast detection of transformed data leaks[mithun_p_c]
 
Basics of Search Engines and Algorithms
Basics of Search Engines and AlgorithmsBasics of Search Engines and Algorithms
Basics of Search Engines and Algorithms
 
Routing algorithm
Routing algorithmRouting algorithm
Routing algorithm
 

Similaire à PyCon 2013 - Experiments in data mining, entity disambiguation and how to think data-structures for designing beautiful algorithms

Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes
 
IRJET- Semantic Question Matching
IRJET- Semantic Question MatchingIRJET- Semantic Question Matching
IRJET- Semantic Question MatchingIRJET Journal
 
White paper - Job skills extraction with LSTM and Word embeddings - Nikita Sh...
White paper - Job skills extraction with LSTM and Word embeddings - Nikita Sh...White paper - Job skills extraction with LSTM and Word embeddings - Nikita Sh...
White paper - Job skills extraction with LSTM and Word embeddings - Nikita Sh...Nikita Sharma
 
IRJET - Automated Essay Grading System using Deep Learning
IRJET -  	  Automated Essay Grading System using Deep LearningIRJET -  	  Automated Essay Grading System using Deep Learning
IRJET - Automated Essay Grading System using Deep LearningIRJET Journal
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
 
Twitter Sentiment Analysis: An Unsupervised Approach
Twitter Sentiment Analysis: An Unsupervised ApproachTwitter Sentiment Analysis: An Unsupervised Approach
Twitter Sentiment Analysis: An Unsupervised ApproachIRJET Journal
 
E-Commerce Product Rating Based on Customer Review
E-Commerce Product Rating Based on Customer ReviewE-Commerce Product Rating Based on Customer Review
E-Commerce Product Rating Based on Customer ReviewIRJET Journal
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...Kumar Goud
 
ClassifyingIssuesFromSRTextAzureML
ClassifyingIssuesFromSRTextAzureMLClassifyingIssuesFromSRTextAzureML
ClassifyingIssuesFromSRTextAzureMLGeorge Simov
 
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdfbeshahashenafe20
 
Souvenir's Booth - Algorithm Design and Analysis Project Project Report
Souvenir's Booth - Algorithm Design and Analysis Project Project ReportSouvenir's Booth - Algorithm Design and Analysis Project Project Report
Souvenir's Booth - Algorithm Design and Analysis Project Project ReportAkshit Arora
 
The search engine index
The search engine indexThe search engine index
The search engine indexCJ Jenkins
 
Designing a Generative AI QnA solution with Proprietary Enterprise Business K...
Designing a Generative AI QnA solution with Proprietary Enterprise Business K...Designing a Generative AI QnA solution with Proprietary Enterprise Business K...
Designing a Generative AI QnA solution with Proprietary Enterprise Business K...IRJET Journal
 
journal papers 25.pdf
journal papers 25.pdfjournal papers 25.pdf
journal papers 25.pdfnareshkotra
 
scopus indexing journal.pdf
scopus indexing  journal.pdfscopus indexing  journal.pdf
scopus indexing journal.pdfnareshkotra
 
scopus index journal.pdf
scopus index journal.pdfscopus index journal.pdf
scopus index journal.pdfnareshkotra
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
 
QER : query entity recognition
QER : query entity recognitionQER : query entity recognition
QER : query entity recognitionDhwaj Raj
 
IRJET- Natural Language Query Processing
IRJET- Natural Language Query ProcessingIRJET- Natural Language Query Processing
IRJET- Natural Language Query ProcessingIRJET Journal
 

Similaire à PyCon 2013 - Experiments in data mining, entity disambiguation and how to think data-structures for designing beautiful algorithms (20)

Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
IRJET- Semantic Question Matching
IRJET- Semantic Question MatchingIRJET- Semantic Question Matching
IRJET- Semantic Question Matching
 
White paper - Job skills extraction with LSTM and Word embeddings - Nikita Sh...
White paper - Job skills extraction with LSTM and Word embeddings - Nikita Sh...White paper - Job skills extraction with LSTM and Word embeddings - Nikita Sh...
White paper - Job skills extraction with LSTM and Word embeddings - Nikita Sh...
 
Text Analytics for Legal work
Text Analytics for Legal workText Analytics for Legal work
Text Analytics for Legal work
 
IRJET - Automated Essay Grading System using Deep Learning
IRJET -  	  Automated Essay Grading System using Deep LearningIRJET -  	  Automated Essay Grading System using Deep Learning
IRJET - Automated Essay Grading System using Deep Learning
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Twitter Sentiment Analysis: An Unsupervised Approach
Twitter Sentiment Analysis: An Unsupervised ApproachTwitter Sentiment Analysis: An Unsupervised Approach
Twitter Sentiment Analysis: An Unsupervised Approach
 
E-Commerce Product Rating Based on Customer Review
E-Commerce Product Rating Based on Customer ReviewE-Commerce Product Rating Based on Customer Review
E-Commerce Product Rating Based on Customer Review
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
 
ClassifyingIssuesFromSRTextAzureML
ClassifyingIssuesFromSRTextAzureMLClassifyingIssuesFromSRTextAzureML
ClassifyingIssuesFromSRTextAzureML
 
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
 
Souvenir's Booth - Algorithm Design and Analysis Project Project Report
Souvenir's Booth - Algorithm Design and Analysis Project Project ReportSouvenir's Booth - Algorithm Design and Analysis Project Project Report
Souvenir's Booth - Algorithm Design and Analysis Project Project Report
 
The search engine index
The search engine indexThe search engine index
The search engine index
 
Designing a Generative AI QnA solution with Proprietary Enterprise Business K...
Designing a Generative AI QnA solution with Proprietary Enterprise Business K...Designing a Generative AI QnA solution with Proprietary Enterprise Business K...
Designing a Generative AI QnA solution with Proprietary Enterprise Business K...
 
journal papers 25.pdf
journal papers 25.pdfjournal papers 25.pdf
journal papers 25.pdf
 
scopus indexing journal.pdf
scopus indexing  journal.pdfscopus indexing  journal.pdf
scopus indexing journal.pdf
 
scopus index journal.pdf
scopus index journal.pdfscopus index journal.pdf
scopus index journal.pdf
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
QER : query entity recognition
QER : query entity recognitionQER : query entity recognition
QER : query entity recognition
 
IRJET- Natural Language Query Processing
IRJET- Natural Language Query ProcessingIRJET- Natural Language Query Processing
IRJET- Natural Language Query Processing
 

Plus de Ekta Grover

Building products with a personality - Future of Work, Yourstory
Building products with a personality - Future of Work, Yourstory Building products with a personality - Future of Work, Yourstory
Building products with a personality - Future of Work, Yourstory Ekta Grover
 
Building Heuristic ARchitecture for Artificial InTelligence (BHARAT) NLP with...
Building Heuristic ARchitecture for Artificial InTelligence (BHARAT) NLP with...Building Heuristic ARchitecture for Artificial InTelligence (BHARAT) NLP with...
Building Heuristic ARchitecture for Artificial InTelligence (BHARAT) NLP with...Ekta Grover
 
Runrate : Where are you going & why ?
Runrate : Where are you going & why  ? Runrate : Where are you going & why  ?
Runrate : Where are you going & why ? Ekta Grover
 
Facilitating product discovery in e-commerce inventory, The Fifth elephant, 2016
Facilitating product discovery in e-commerce inventory, The Fifth elephant, 2016Facilitating product discovery in e-commerce inventory, The Fifth elephant, 2016
Facilitating product discovery in e-commerce inventory, The Fifth elephant, 2016Ekta Grover
 
National Sales Convention, Kuala Lumpur
National Sales Convention, Kuala LumpurNational Sales Convention, Kuala Lumpur
National Sales Convention, Kuala LumpurEkta Grover
 
Send that (damn) elevator down !
Send that (damn) elevator down !Send that (damn) elevator down !
Send that (damn) elevator down !Ekta Grover
 
Competition a potent tool for economic development and Socio - Economic welfare
Competition a potent tool for economic development and Socio - Economic welfareCompetition a potent tool for economic development and Socio - Economic welfare
Competition a potent tool for economic development and Socio - Economic welfareEkta Grover
 
The rise of Social Capital and collapse of traditional Market Signalling
The rise of Social Capital and collapse of traditional Market Signalling The rise of Social Capital and collapse of traditional Market Signalling
The rise of Social Capital and collapse of traditional Market Signalling Ekta Grover
 
Master thesis - How we think, we think is not how we really think
Master thesis - How we think, we think is not how we really think Master thesis - How we think, we think is not how we really think
Master thesis - How we think, we think is not how we really think Ekta Grover
 
16th World Business Dialogue
16th World Business Dialogue16th World Business Dialogue
16th World Business DialogueEkta Grover
 

Plus de Ekta Grover (10)

Building products with a personality - Future of Work, Yourstory
Building products with a personality - Future of Work, Yourstory Building products with a personality - Future of Work, Yourstory
Building products with a personality - Future of Work, Yourstory
 
Building Heuristic ARchitecture for Artificial InTelligence (BHARAT) NLP with...
Building Heuristic ARchitecture for Artificial InTelligence (BHARAT) NLP with...Building Heuristic ARchitecture for Artificial InTelligence (BHARAT) NLP with...
Building Heuristic ARchitecture for Artificial InTelligence (BHARAT) NLP with...
 
Runrate : Where are you going & why ?
Runrate : Where are you going & why  ? Runrate : Where are you going & why  ?
Runrate : Where are you going & why ?
 
Facilitating product discovery in e-commerce inventory, The Fifth elephant, 2016
Facilitating product discovery in e-commerce inventory, The Fifth elephant, 2016Facilitating product discovery in e-commerce inventory, The Fifth elephant, 2016
Facilitating product discovery in e-commerce inventory, The Fifth elephant, 2016
 
National Sales Convention, Kuala Lumpur
National Sales Convention, Kuala LumpurNational Sales Convention, Kuala Lumpur
National Sales Convention, Kuala Lumpur
 
Send that (damn) elevator down !
Send that (damn) elevator down !Send that (damn) elevator down !
Send that (damn) elevator down !
 
Competition a potent tool for economic development and Socio - Economic welfare
Competition a potent tool for economic development and Socio - Economic welfareCompetition a potent tool for economic development and Socio - Economic welfare
Competition a potent tool for economic development and Socio - Economic welfare
 
The rise of Social Capital and collapse of traditional Market Signalling
The rise of Social Capital and collapse of traditional Market Signalling The rise of Social Capital and collapse of traditional Market Signalling
The rise of Social Capital and collapse of traditional Market Signalling
 
Master thesis - How we think, we think is not how we really think
Master thesis - How we think, we think is not how we really think Master thesis - How we think, we think is not how we really think
Master thesis - How we think, we think is not how we really think
 
16th World Business Dialogue
16th World Business Dialogue16th World Business Dialogue
16th World Business Dialogue
 

Dernier

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Dernier (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

PyCon 2013 - Experiments in data mining, entity disambiguation and how to think data-structures for designing beautiful algorithms

  • 1. Experiments in data mining, entity disambiguation and how to think data-structures for designing beautiful algorithms Ekta Grover 1st September, 2013 @ektagrover/ekta1007@gmail.com PyCon India, 2013 1/26
  • 2. Structure of this Talk How to think through a data mining problem 3 problems in data-mining around disambiguation, Natural language processing & Information Retrieval Measuring your Model performance - developing a near-real time performance aggregation platform Putting it all together @ektagrover/ekta1007@gmail.com PyCon India, 2013 2/26
  • 3. The only algorithm you will ever need to know 1 Write down the problem 2 Think real hard 3 Write down the solution @ektagrover/ekta1007@gmail.com PyCon India, 2013 3/26
  • 4. Thinking through Data-Mining problem What is the data and how should you represent it ? What are the relationships you are interested in ? How to define your Objective Function ? How will you know that the score represents the direction you want to capture ? How will you measure the performance after you are done ? @ektagrover/ekta1007@gmail.com PyCon India, 2013 4/26
  • 5. Problem # 1 : Entity Disambiguation Problem So, How much is your network really worth ? What is Disambiguation ? Information = Related + Unrelated Extract significant attributes/features of an entity Relationships between entities & features Objective function is a just a way to measure - rank/score the features for an entity wrt to other entities Disambiguation is not a deterministic problem @ektagrover/ekta1007@gmail.com PyCon India, 2013 5/26
  • 6. Problem # 1 : Entity Disambiguation Guiding question #1 What form of disambiguation to use for the entities ? Algorithm & approach 1 Preprocessing - lower casing & removing punctuation 2 Exploratory Analysis - frequency plots (with NLTK freqdist & Pretty table) 3 Tokenization to find ”keywords” to disambiguate 4 Handling stop words & Internationalization 5 Frequent item-set mining with support = #of buckets 6 Post processing, frequency plots & visualization Extensions: Handling Typo graphical errors and augmenting this data-set with scraping @ektagrover/ekta1007@gmail.com PyCon India, 2013 6/26
  • 7. Problem # 1 : Talking Specifics & answering the ”Why” Frequent item-set mining & concept of support How do we know the number of significant words in an entity name? 1 For every company name - strip the 1st word and store it as the key of ”local” dictionary, with values as list 2 On the local dict created in step 1, do frequent item-set mining with support = # baskets - if a word is significant(and uniquely describes an entity), it should be present in all baskets 3 Treat all the values in a local dict as a set - If all elements of a local dict are same - no need for item-set mining 4 For every key - find the smallest list of tokens(by length) and search the presence of all tokens iteratively till the support appears in less than #baskets 5 Replace the words fetched post support to equal all values in that dictionary 6 Fetch all values from all keys in the dict of list - flatten this list 7 Regular frequency counts on this flattened list of now disambiguated elements @ektagrover/ekta1007@gmail.com PyCon India, 2013 7/26
  • 8. Problem # 1 : Limitations of this approach 1 Abbreviations : Boston Consulting Group & BCG, State Bank of India & SBI - build a referencing domain specific Word Sense Disambiguation (WSD) database 2 Differently referenced entities such as ”Innovation labs 247 customer private ltd” and ”247 innovation labs” : The referencing key will be different for these 3 Sub-entities of a company with punctuation ”Ebay/Paypal” - the ”key” depends on how we choose to treat the punctuation for {/,:,-} 4 Need a module to handle inadvertent typographical errors in entity names, names are proper nouns, yet no referencing from Wordnet database - plug in Problem #2 Extensions Cross document and Co-reference resolution,Multi-pass learning, smooth- ing & annealed learning algorithms @ektagrover/ekta1007@gmail.com PyCon India, 2013 8/26
  • 9. Problem # 1 : Representation @ektagrover/ekta1007@gmail.com PyCon India, 2013 9/26
  • 10. Problem # 1 : Representation, a more extreme example & Codewalk Codewalk from Github https://github.com/ekta1007/Experiments-in-Data-mining @ektagrover/ekta1007@gmail.com PyCon India, 2013 10/26
  • 11. Problem # 1 : Visualization YYYYYYYSAPY[Y57Y]YYY [24]7,Inc6[37] VMware[YzZY] IbmYY[YZ4Y] GoogleY[YZ5Y] ThoughtworksY[YZBY]Yahoo Y[ ZBY] EbayYincY[Y8Y] IntuitY[Y8Y] MicrosoftYCorporationY[Y8Y] Accenture66[676] LinkedinY[Y7Y] University6of6Cologne6[676] DellY[Y6Y] Deutsche6Bank6[666] FlipkartFcomY[Y6Y] BostonYConsultingYGroupY[Y6Y] ZyngaY[Y6Y] AmazonY[Y5Y] Anita6Borg6Institute6for6Women6&Technology6[656] Top companies by Frequency distribution, post item-set mining *I used R(ggplot2 and igraph) to plot the nodes with ColorBrewer palette and Inkspace for ”cosmetics” @ektagrover/ekta1007@gmail.com PyCon India, 2013 11/26
  • 12. Problem # 2 : Custom Distance metric for handling Typographical errors optimized for a QWERTY keyboard Usecase Disambiguating the typographical errors with a contrasting (noun)item Three types of misspellings1 1 Typographic errors - know the correct spelling, but motor coordination slip when typing 2 Cognitive errors - caused by lack of knowledge of the person 3 Phonetic errors2 - sound correct, but are orthographically incorrect Distance measures: Levenshtein, Manhattan, Euclidean, Cosine - inappropriate for distance in our context. Need a hybrid contextual approach 1 Karen Kukich - Techniques for automatically correcting words in text, ACM Computing Surveys Volume 24 Issue 4, Dec. 1992 ; Creating a Spellchecker with Solr http://emmaespina.wordpress.com/2011/01/31/20/ 2 Soundex algorithm; example : Chebyshev & Tchebycheff @ektagrover/ekta1007@gmail.com PyCon India, 2013 12/26
  • 13. Problem # 2 : QWERTY distance metric Guiding Assumption Since typographical errors are inadvertently introduced, they should not count towards the overall distance Algorithm & approach? 1 Data structures - QWERTY keyboard as a graph with connected adjacent nodes, neighborhood nodes with appropriate distances 2 Define a typographical error & it’s thresholds- what can be the difference in length of words, how many possible ”swaps” can occur in a typo 3 Implement the custom distance metric - fall back to Levenshtein when non-typographical - Hybrid approach 4 Output whether the words are identical, or the distance taking typographical errors into account 5 Contrast and blend in with other fuzzy logic methods, Apache Solr, Spell checker & Auto-suggest @ektagrover/ekta1007@gmail.com PyCon India, 2013 13/26
  • 14. Problem # 2 : Representation & Codewalk QWERTY Keyboard representation Codewalk from Github https://github.com/ekta1007/Custom-Distance-function-for-typos-in- hand-generated-datasets-with-QWERY-Keyboard/ @ektagrover/ekta1007@gmail.com PyCon India, 2013 14/26
  • 15. Problem # 3 : Designing your custom job feeds at Linkedin Part 1 Create filtering criteria to build your custom data-set Linkedin REST API’s, scraping(Scrapy, Beautiful soup), raw html with urllib & NLTK Part 2 Build relevance score of the filtered results TF-IDF(with cosine similarity), Structured Relevance Models (SRM), Prox- imity search, classical Information Retrieval & pattern matching algorithms @ektagrover/ekta1007@gmail.com PyCon India, 2013 15/26
  • 16. Problem # 3 : Part 1 Algorithm & approach 1 Exploit the commonality in url pattern of the Jobs page, looping constructs for job index http://www.linkedin.com/jobs? viewJob=&jobId=job_index&trk=rj_jshp 2 Post this only 4 options exist per Job page 1 The job you’re looking for is no longer active 2 We can’t find the job you’re looking for 3 The job passes your filtering criteria (location, skill-specific keywords etc) 4 The job is active, but does not pass your filtering criteria 3 For urls in 2.3, build a simple csv file with all elements of interest - Company Name, Title of Position, Job Posting Url, Posted on, Functions, Industry, similar Jobs 4 Can pre-filter by company name - by heuristics, or a quick scoring function for each company name @ektagrover/ekta1007@gmail.com PyCon India, 2013 16/26
  • 17. Problem # 3 : Part 2 Algorithm & Approach 1 A document is a bag-of-words - Tokenize, standard preprocessing(punctuation, lower case) 2 Exploratory analysis with NLTK’s Dispersion plot, FreqDist, collocations, common contexts, count, concordance 3 Lemmatization & stemming with NLTK’s derivationally related forms with Morphy(Wordnet), Porter, Wordnet app 4 TF-IDF - with stop word list from NLTK & additional weighage for words of your specific interest 5 Inverted index - for each word in vocabulary, store the doc-id with it’s TF-IDF, making up the whole TF-IDF vector 6 Given the TF-IDF vector of the document corpus & query, compute the cosine similarity of each document with the base query 7 Output top-k relevant Job listings, with the top-k’ words in each job feed 8 Augment this with a classical structured relevance model, Proximity search or pattern matching algorithm @ektagrover/ekta1007@gmail.com PyCon India, 2013 17/26
  • 18. Problem # 3 : Representation & Visualization BusinessAnalysisSr. ManageratDell inBangalore[ 0.99374331] SrI6Software6Development6Engineer6x6Architect6−6Ad6TechFAmazons6[65IbWWGHNNL6] Technical6Program6Manager6−6Amazon6Fashion6TechnologyFAmazons6[65IbWCbW5WGG6] Data6AnalyticsxBusiness6Intelligence6Engineer6−6Buying6ExperienceFAmazons6[65Ib:bC55WL:6] SrI6Software6Development6Engineer6P6Collective6Human6IntelligenceFAmazons6[65IbHW:HGW::6] Lead6Analyst,6Statistical6Modelling6FWNS6Global6Servicess6[65Ib5GCNWCL6] Principal6ScientistB6SrI6Computer6ScientistB6Member6of6Research6StaffFAdobes6[65IG&5&WHL5G6] Research6ScientistFYahoos6[65INb5L:HCNG6] RPD−Biologics−Sr6ScientistFBiocons6[65INNNCHLWC&6] Data6ScientistFNetApps6[65I:N555H5WW6] Relevant Job feeds ranked by Cosine Similarity & TF-IDF to the base query @ektagrover/ekta1007@gmail.com PyCon India, 2013 18/26
  • 19. Problem # 3 : Representation & Codewalk dell5[50.023801025] procurement5[50.0230164475] develops5[50.0104620215] strategy5[50.0104129465] goals5[50.0083696175] reports5[50.0083696175] analyzes5[50.0083696175] internal5[50.0062772135] planning5[50.0062772135] BusinessdAnalysisdSr.dManager(Dell) advertising5[50.0157264585] highly5[50.0088483225] amazon5[50.0070883285] traffic5[50.0066362425] advertisement5[50.0066362425] increase5[50.0066362425] architect5[50.0065324785] ad5[50.0059930035] distributed5[50.0047944025] Sr.dSoftwaredDevelopmentdEngineerd/dArchitect-AddTech(Amazon) nextorbit5[50.0478708685] energy5[50.0157405285] discovering5[50.0130556915] understanding5[50.0110156185] scientist5[50.011000655] enable5[50.0094321815] space5[50.009281755] trends5[50.009281755] Sr.dEnergydDatadScientist(NextOrbitdInc) Top ranking words in the respective job-feeds3 Codewalk from Github https://github.com/ekta1007/Creating-Custom-job-feeds-for-Linkedin 3 Note that some of the words are relatively uninformative like ”understanding”, ”Amazon” - could use any amongst Proximity search, Phase search, Positional index, Next word index @ektagrover/ekta1007@gmail.com PyCon India, 2013 19/26
  • 20. Problem # 4 : Designing a near-real time Performance Aggregation Platform Concept Once you roll your competing candidate models into production, they should self-decide based on performance Competing Goals can be amongst 1 Minimizing entropy of the complete system(more centered distribution) 2 Optimizing precision & recall4 - tradeoff between type I & type II 3 Accounting for trends - 1st order(velocity) and 2nd (acceleration) 4 Minimize bias - Low mean square error, moving towards a more global representation of a problem 4 precision - fraction of retrieved instances that are relevant, recall - fraction of relevant instances that are retrieved @ektagrover/ekta1007@gmail.com PyCon India, 2013 20/26
  • 21. Problem # 4 : Defining the aggregated proxy for Performance Algorithm & approach 1 Apriori: Initialize the candidate population, all else equal 2 Define model equations, which generate the scores, per model 3 Define the metrics for weighted performance and what is considered a ”success event” 4 Define threshold conditions and measure for success for each metric, wherever applicable 5 Scale the aggregated performance to have the next period candidate population 6 Roll out the new candidate population to next period, per model (Probabilistic model) Scores are centrality scores(Welford’s Algorithm) - not to be over- enthusiastic(pessimistic) for state-outcomes and weights for recency of performance, over long time learning cycles @ektagrover/ekta1007@gmail.com PyCon India, 2013 21/26
  • 22. Problem # 4 : Designing the performance benchmark Proxy for bias Type I & Type II errors Measure of entropy Velocity Metric Acceleration Metric Time period Model # Mean Square Precision Recall Mean Mean Variance Velocity Velocity Acceleration Acceleration bench-marked against Error Score Score by Metric 1 by Metric 2 by Metric 1 by Metric 2 t1 M1 M2 M3 M4 M5 t2 M1 M2 M3 M4 M5 @ektagrover/ekta1007@gmail.com PyCon India, 2013 22/26
  • 23. Problem # 4 : Representation & Codewalk Aggregated performance in Period 1 Codewalk from Github https://github.com/ekta1007/Performance-aggregation-platform— learning-in-near-real-time @ektagrover/ekta1007@gmail.com PyCon India, 2013 23/26
  • 24. Problem # 4 : Representation cont.. Aggregated performance in Period 1 & 2 respectively @ektagrover/ekta1007@gmail.com PyCon India, 2013 24/26
  • 25. Resources Link to the problems I talked today(I keep updating this) Understanding data structures in Python A Programmer’s Guide to Data Mining - Ron Zacharski Creating a Spell Checker with Solr An Intuitive example to TF-IDF, Iraq War logs - Jonathan Stray An Introduction to Information Retrieval - Christopher D. Manning, Prabhakar Raghavan, Hinrich Schutze Natural Language Processing with Python - Steven Bird, Ewan Klein, Edward Loper Stemming & lemmatization Programming Collective Intelligence: Building Smart Web 2.0 Applications - Toby Segaran Mining the Social Web: Analyzing Data from Facebook, Twitter, LinkedIn - Matthew A. Russell @ektagrover/ekta1007@gmail.com PyCon India, 2013 25/26