SlideShare une entreprise Scribd logo
1  sur  41
Télécharger pour lire hors ligne
Mapping Domain Names to Categories
Maya Rotmensch, Sorcha Gilroy, Corina Gur˘au
Academic Mentor: Cristina Garcia-Cardona
Industry Sponsor: Oversee.net (Kryztof Urban)
Institute of Pure and Applied Mathematics
Research in Industrial Projects
August 15, 2013
Institute for Pure & Applied Mathematics
University of California, Los Angeles
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 1 / 41
Outline
1 Oversee.net
2 Problem Statement
Why so complicated?
ESA - Explicit Semantic Analysis
How Oversee.net Does It
3 Our Project
Our Focus
Methodology
Results
4 Concluding Remarks
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 2 / 41
Outline
1 Oversee.net
2 Problem Statement
Why so complicated?
ESA - Explicit Semantic Analysis
How Oversee.net Does It
3 Our Project
Our Focus
Methodology
Results
4 Concluding Remarks
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 3 / 41
Oversee.net’s Business Model
Person Website
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 4 / 41
Person looking for games A gaming website
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 5 / 41
Oversee.net’s Business Model
Person looking for games Domain A gaming website
Direct Navigation: when users navigate to a website by using the
address bar instead of a search engine.
looking for a gaming website → navigates to ’addictinggamas.com’
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 6 / 41
Oversee.net’s Business Model
Domain parking + traffic matching −→ Oversee.net
Person Domain Category Website
Monetized Domain Parking
The registration of internet domain names without placing any
content on the domain.
Owners monetize traffic by displaying links and advertisements
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 7 / 41
Oversee.net’s Business Model
Advertisers
Partners of Oversee.net
Choose the types of traffic they want from Oversee.net’s category tree
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 8 / 41
Oversee.net’s Business Model
Parked domains do not have any content
Mapping Domains to Categories is extremely difficult
Oversee.net uses Keywords to describe Domains and Categories
Domain Keywords Keywords Category
Not enough, as we are not guaranteed use of same language!
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 9 / 41
Outline
1 Oversee.net
2 Problem Statement
Why so complicated?
ESA - Explicit Semantic Analysis
How Oversee.net Does It
3 Our Project
Our Focus
Methodology
Results
4 Concluding Remarks
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 10 / 41
So what’s the big deal?
Reasoning about concepts
Scarcity of input information
Example 1 - Spelling error
cheapvacatins.com
Example 2 - Ambiguous meaning
bigbearhuts.com (animals? huts? it’s supposed to be winter sports)
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 11 / 41
Text Categorization
Our problem can be thought of as a problem of categorization. We
need to assign a domain to one or more classes or categories
A natural choice is topic modeling
However, unlike most text categorization problems, we don’t actually
have documents to classify, as we are dealing with undeveloped
domains
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 12 / 41
Topic Modeling
This method analyzes the relationships between documents in a corpus by
isolating a set of topics from the documents
For meaningful results, one must work with a set of large texts
Our data set consists of keywords, as our domains are undeveloped
This method results in organic generation of topics
The categories we are attempting to map into are pre-defined
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 13 / 41
ESA - Explicit Semantic Analysis
Building a Semantic Interpreter
Using a Vector Space Model + an exogeneous knowledge base
−→ represent the meaning of text
1
# of articles ∼ 3.5 Million
# of terms ∼ 45 Million
1
Evgeniy Gabrilovich and Shaul Markovitch. Computing Semantic Relatedness using Wikipedia-based Explicit
Semantic Analysis, 2007. Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI)
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 14 / 41
ESA - Explicit Semantic Analysis
Government Finance Toys Children Bank School . . .
Law 0.2 0.3 0.8 0.9 0.2 0.7 . . .
Article2 0.8 0.9 0.1 0.3 0.7 0.5 . . .
Article3 0.5 0.2 0.3 0.6 0.4 0.8 . . .
Article4 0.1 0.2 0.1 0.3 0.4 0.2 . . .
...
...
...
...
...
...
...
...
Term frequency inverse document frequency:
tfidf (t, d, D) = tf (t, d) × idf (t, D)
Logarithmically scaled term frequency:
tf (t, d) = log(f (t, d) + 1)
Inverse document frequency:
idf (t, D) = log
|D|
|d ∈ D : t ∈ d|
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 15 / 41
ESA - Explicit Semantic Analysis
Using a Semantic Interpreter
Cosine similarity measure
similarity = cos(θ) =
A · B
||A|| ||B||
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 16 / 41
How Oversee.net Does It
Instead of comparing two texts - compare two small sets of words!
Use keywords to describe domains and categories
Represent these keywords in terms of DBpedia articles
A keyword is significantly related to an article if the TF-IDF is above a
certain threshold
The set of articles associated to a domain/category is the union of the
sets of articles associated to its keywords
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 17 / 41
How Oversee Does It
Compare the two sets of articles (A - domains, B - categories) using
the Jaccard Index:
J(A, B) =
|A ∩ B|
|A ∪ B|
Categories with highest scores using this index are matched to a
domain
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 18 / 41
Outline
1 Oversee.net
2 Problem Statement
Why so complicated?
ESA - Explicit Semantic Analysis
How Oversee.net Does It
3 Our Project
Our Focus
Methodology
Results
4 Concluding Remarks
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 19 / 41
Our Focus
Domain Keywords Keywords Category
Critical link: domains to keywords
Improve quality of keywords
Click Through Rate
String Similarity
Semantic Analysis
Keyword CTR String Similarity Semantic Similarity
industrial 20 80 0
industriel 20 89 0
industrie 20 100 0
china manufacturer 20 0 88
industries 20 80 98
industrial companies 20 0 86
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 20 / 41
Domain Keywords
Focusing on developing the link between domains and keywords, the two
main questions we posed for our research were:
Could we use ESA to extend the number of meaningful keywords per
domain?
Could we use the keywords obtained through Oversee.net inhouse
statistics as the basis of the new keywords?
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 21 / 41
Methodology
Extending the set of keywords:
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 22 / 41
Methodology
Extending the set of keywords:
When generating new keywords:
Only take top 3 articles
Only take top 2 terms
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 23 / 41
Methodology
Method 2 for extending the set of keywords:
Breaking up and correcting the domain name
chaselogon.com
haselogon
aselogon
cha selogon
chas elogon
chase logon
chasel ogon
chaselo gon
chaselog
chaselogo
Example: domain = ’chaselogon.com’
If entire string matches a word in reference file then stop
If both parts of broken string are exact words then stop
If substring is an exact word then correct other part using edit
distances
Corrections used: deletions, transpositions, replacements, insertions
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 24 / 41
Methodology
Method 2 for extending the set of keywords:
Reference file made up of collections of text, have added more
information
Company names
Popular websites
Brand and store names
Countries and major cities
Initial Keywords Keywords after parsing
chameloeon chas
chase
elson
login
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 25 / 41
Methodology
Generating new keywords and mapping to categories
bankfianancial.com
ncofinancial
ban
bank
financial
financial institutions
financial centre
lobsters
official personal
societies chairman
. . .
Jaccard Index = 0.240492
finance
retirement pension
debit card
tenant credit check
...
Jaccard Index = 0.348147
credit cards
debit card
credit applications
rewards program
...
Jaccard Index = 0.219457
banking
savings banking
checks
community bank
...
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 26 / 41
Results: Comparing Their Keywords to Semantic
We were given a sample of 300 domains that had been matched by
hand to a total of 500 categories
CTR & String Similarity CTR, String Similarity & Semantic Analysis
Number of matches 25 309
percentage of match 5% 61.8%
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 27 / 41
Results: Generating New Keywords
Using Method 1:
CTR & String Similarity Method 1 CTR & String Similarity & 7 Random
Number of matches 25 21 24
percentage of match 5% 4.2% 4.8%
Most of the time, the different methods yielded the same results
Cases where the new keywords improved the system:
thhetrainline.com
Cases where the base case did better:
inindustries.com
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 28 / 41
Results
thhetrainline.com
thetrainline
Jaccard Index = 0.0001 microcars & city cars
Jaccard Index = 0.0002 property management
thhetrainline.com
thetrainline
strafe train
moving departing
train station
telecommunications
georgia
rain shine
. . .
Jaccard Index = 0.1348 bus & rail
Jaccard Index = 0.2255 libraries & museums
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 29 / 41
Results
inindustries.com
industrial
industrias
industriel
. . .
Jaccard Index = 0.0786 manufacturing
inindustries.com
industrial
industrias
industriel
. . .
ministry
quarterly garden/outdoor
filipino footballer
. . .
Jaccard Index = 0.099 tourist destinations
Jaccard Index = 0.1326 real estate
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 30 / 41
Results: Parsing the Domains
Using Method 1 & 2:
CTR & String Similarity Method 1 & 2 CTR & String Similarity & 15 Random
Number of matches 25 93 23
percentage of match 5% 18.6% 4.6%
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 31 / 41
Results - Parsing the Domains
chaselogon.com
chameloeon
No category matched
addictinggamas.com
chameloeon
chas
chase
elson
login
password
journalists cyber
logins expensive
beatles
. . .
Jaccard Index =0.4637 credit cards
Jaccard Index = 0.4637 banking
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 32 / 41
Results: Parsing the Domains
Using Method 2:
CTR & String Sim. Method 1& 2 Method 2
Number of matches 25 97 77 out of 356
percentage of match 5% 19.4% ∼ 21.6 %
Initial results show that overall, just using parsing might be more beneficial
→ depends on the amount of noise.
Example with a lot of noise:
mobilestorage.ca
Example with minimal noise:
addictinggamas.com
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 33 / 41
Results - Amplification of noise
mobilestorage.ca
gfilestorage
mobileshop
mobile
storage
age
investor
vilest
. . .
Jaccard Index = 0.1011 mobile & wireless
Jaccard Index = 0.0959 music & audio
mobilestorage.ca
gfilestorage
mobileshop
mobile
storage
age
investor
vilest
. . .
legal age
taylor
phone companies
mobil
. . .
Jaccard Index =0.0942 music & audio
Jaccard Index = 0.0887 education
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 34 / 41
Results - Minimal noise
addictinggamas.com
addictinggams
addictivegames
adictigegames
. . .
addict
addicting
games
ingram
. . .
Jaccard Index = 0.0153 software
addictinggamas.com
addictinggams
addictivegames
adictigegames
. . .
addict
addicting
games
ingram
. . .
gameplay requires
game
impulsedriven flash
add ons
. . .
Jaccard Index = 0.2019 computer & video games
Jaccard Index = 0.1975 games
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 35 / 41
Results: Extended Matches
Using Extended Matches:
We extended possible matches to parent and root nodes of the
category tree.
Checked in how many cases did the parent or root node of the
categories we got matched the manual matching.
CTR & String Sim. Method 1 Method 1& 2 Method 2
Number of matches 25 21 97 77 out of 356
percentage of match 5% 4.2% 19.4% ∼ 21.6 %
Number of extended matches 32 29 128 102 out of 356
Percentage of matches 6.4% 5.8% 25.6% ∼ 28.7 %
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 36 / 41
Outline
1 Oversee.net
2 Problem Statement
Why so complicated?
ESA - Explicit Semantic Analysis
How Oversee.net Does It
3 Our Project
Our Focus
Methodology
Results
4 Concluding Remarks
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 37 / 41
Conclusion
Implemented a program to match domains with categories
Created an ESA based method to amplify existing keywords
Adapted a domain name parsing and spell correcting method
Revisiting our research questions:
Could we use ESA to extend the number of meaningful keywords per
domain? → Yes
Could we use the keywords obtained through Oversee.net inhouse
statistics as the basis of the new keywords? → No. Or at least
further processing must be done.
getting better & more keywords → getting a few good keywords
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 38 / 41
Future Directions
Find out how many good initial keywords are required to use our
method successfully
Explore a better way of ranking keywords and determine which are
the most descriptive ones
Click through rate and string similarity comparisons are not sufficiently
descriptive, need a better scoring method
Have a reference of the most popular websites, so that the domains
given could be compared to these
Analyze content in websites to amplify domain to category mapping
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 39 / 41
Thank you!
Academic Mentor: Cristina Garcia-Cardona
Industry Sponsor: Kryztof Urban and Oversee.net
RIPS Director: Dr. Michael Raugh
Director of IPAM: Dr. Russ Caflisch
IPAM Staff: Dimi, Stacey, Stacy, Roland, Stephanie, and everyone
that made RIPS possible
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 40 / 41
Questions?
Thank you for listening!
(Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 41 / 41

Contenu connexe

Tendances

Part 1
Part 1Part 1
Part 1
butest
 
Coverage-Criteria-for-Testing-SQL-Queries
Coverage-Criteria-for-Testing-SQL-QueriesCoverage-Criteria-for-Testing-SQL-Queries
Coverage-Criteria-for-Testing-SQL-Queries
Mohamed Reda
 

Tendances (13)

Semi Automatic to Improve Ontology Mapping Process in Semantic Web Data Analysis
Semi Automatic to Improve Ontology Mapping Process in Semantic Web Data AnalysisSemi Automatic to Improve Ontology Mapping Process in Semantic Web Data Analysis
Semi Automatic to Improve Ontology Mapping Process in Semantic Web Data Analysis
 
Web Page Recommendation using Domain Knowledge and Web Usage Knowledge
Web Page Recommendation using Domain Knowledge and Web Usage KnowledgeWeb Page Recommendation using Domain Knowledge and Web Usage Knowledge
Web Page Recommendation using Domain Knowledge and Web Usage Knowledge
 
Effective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch AlgorithmEffective Data Retrieval in XML using TreeMatch Algorithm
Effective Data Retrieval in XML using TreeMatch Algorithm
 
DSA-Lecture-05
DSA-Lecture-05DSA-Lecture-05
DSA-Lecture-05
 
Intelligent Hiring with Resume Parser and Ranking using Natural Language Proc...
Intelligent Hiring with Resume Parser and Ranking using Natural Language Proc...Intelligent Hiring with Resume Parser and Ranking using Natural Language Proc...
Intelligent Hiring with Resume Parser and Ranking using Natural Language Proc...
 
Rs web context_content__v4.0__20120908_ma
Rs web context_content__v4.0__20120908_maRs web context_content__v4.0__20120908_ma
Rs web context_content__v4.0__20120908_ma
 
IRJET- Missing Value Evaluation in SQL Queries: A Survey
IRJET- 	  Missing Value Evaluation in SQL Queries: A SurveyIRJET- 	  Missing Value Evaluation in SQL Queries: A Survey
IRJET- Missing Value Evaluation in SQL Queries: A Survey
 
Resume parser
Resume parserResume parser
Resume parser
 
104333 sri vidhya eng notes
104333 sri vidhya eng notes104333 sri vidhya eng notes
104333 sri vidhya eng notes
 
Complex Matching of RDF Datatype Properties
Complex Matching of RDF Datatype PropertiesComplex Matching of RDF Datatype Properties
Complex Matching of RDF Datatype Properties
 
Part 1
Part 1Part 1
Part 1
 
Coverage-Criteria-for-Testing-SQL-Queries
Coverage-Criteria-for-Testing-SQL-QueriesCoverage-Criteria-for-Testing-SQL-Queries
Coverage-Criteria-for-Testing-SQL-Queries
 
Positional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted IndexesPositional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted Indexes
 

En vedette

2014-03-18 US OER Policy Overview for #OERPolicyWorks
2014-03-18 US OER Policy Overview for #OERPolicyWorks2014-03-18 US OER Policy Overview for #OERPolicyWorks
2014-03-18 US OER Policy Overview for #OERPolicyWorks
Nicole Allen
 
Netherlands
NetherlandsNetherlands
Netherlands
Lexi34
 
Edct 203 11b
Edct 203 11bEdct 203 11b
Edct 203 11b
colinissa
 
Lansare.Bolucencova
Lansare.BolucencovaLansare.Bolucencova
Lansare.Bolucencova
Adela Negura
 
Hire the right driver
Hire the right driverHire the right driver
Hire the right driver
Pete DiSantis
 
Desenho Parte Mecânica TID 3
Desenho Parte Mecânica TID 3Desenho Parte Mecânica TID 3
Desenho Parte Mecânica TID 3
Sgtmuniz15
 
Mitologie universala.11.mit. romaneasca
Mitologie universala.11.mit. romaneascaMitologie universala.11.mit. romaneasca
Mitologie universala.11.mit. romaneasca
Adela Negura
 
Contents page analysis
Contents page analysisContents page analysis
Contents page analysis
yumm
 

En vedette (20)

2014-03-18 US OER Policy Overview for #OERPolicyWorks
2014-03-18 US OER Policy Overview for #OERPolicyWorks2014-03-18 US OER Policy Overview for #OERPolicyWorks
2014-03-18 US OER Policy Overview for #OERPolicyWorks
 
2007-10-19 Working With Faculty (SWSLC)
2007-10-19 Working With Faculty (SWSLC)2007-10-19 Working With Faculty (SWSLC)
2007-10-19 Working With Faculty (SWSLC)
 
Tanjaouiates au Rallye Aicha des Gazelles
Tanjaouiates au Rallye Aicha des GazellesTanjaouiates au Rallye Aicha des Gazelles
Tanjaouiates au Rallye Aicha des Gazelles
 
Netherlands
NetherlandsNetherlands
Netherlands
 
Edct 203 11b
Edct 203 11bEdct 203 11b
Edct 203 11b
 
Romanian Design Week 2016
Romanian Design Week 2016Romanian Design Week 2016
Romanian Design Week 2016
 
Intro to n screen-rev
Intro to n screen-revIntro to n screen-rev
Intro to n screen-rev
 
L'Anthropocène et ses victimes - François Gemenne
L'Anthropocène et ses victimes  - François GemenneL'Anthropocène et ses victimes  - François Gemenne
L'Anthropocène et ses victimes - François Gemenne
 
Accommodation
AccommodationAccommodation
Accommodation
 
Lansare.Bolucencova
Lansare.BolucencovaLansare.Bolucencova
Lansare.Bolucencova
 
Hire the right driver
Hire the right driverHire the right driver
Hire the right driver
 
Spectral functions and geometric invariants
Spectral functions and geometric invariantsSpectral functions and geometric invariants
Spectral functions and geometric invariants
 
George Business Consultancy Operating Model
George Business Consultancy Operating ModelGeorge Business Consultancy Operating Model
George Business Consultancy Operating Model
 
As 3R
As 3RAs 3R
As 3R
 
2012-10-24 OER and Solving the Textbook Cost Crisis
2012-10-24 OER and Solving the Textbook Cost Crisis2012-10-24 OER and Solving the Textbook Cost Crisis
2012-10-24 OER and Solving the Textbook Cost Crisis
 
Bitcoin 101
Bitcoin 101Bitcoin 101
Bitcoin 101
 
Desenho Parte Mecânica TID 3
Desenho Parte Mecânica TID 3Desenho Parte Mecânica TID 3
Desenho Parte Mecânica TID 3
 
Mitologie universala.11.mit. romaneasca
Mitologie universala.11.mit. romaneascaMitologie universala.11.mit. romaneasca
Mitologie universala.11.mit. romaneasca
 
Lean nella azienda ed technologia
Lean nella azienda ed technologia Lean nella azienda ed technologia
Lean nella azienda ed technologia
 
Contents page analysis
Contents page analysisContents page analysis
Contents page analysis
 

Similaire à Mapping Domain Names to Categories

DataScience SG | Undergrad Series | 26th Sep 19
DataScience SG | Undergrad Series | 26th Sep 19DataScience SG | Undergrad Series | 26th Sep 19
DataScience SG | Undergrad Series | 26th Sep 19
Yong Siang (Ivan) Tan
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
dannyijwest
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
IJwest
 
Semantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningSemantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data Mining
Editor IJCATR
 
Analyzing the solutions of DEA through information visualization and data min...
Analyzing the solutions of DEA through information visualization and data min...Analyzing the solutions of DEA through information visualization and data min...
Analyzing the solutions of DEA through information visualization and data min...
Gurdal Ertek
 

Similaire à Mapping Domain Names to Categories (20)

Twitter Sentiment Analysis: An Unsupervised Approach
Twitter Sentiment Analysis: An Unsupervised ApproachTwitter Sentiment Analysis: An Unsupervised Approach
Twitter Sentiment Analysis: An Unsupervised Approach
 
Conceptual design & ER Model.pptx
Conceptual design & ER Model.pptxConceptual design & ER Model.pptx
Conceptual design & ER Model.pptx
 
Research @ RELEASeD (presented at SATTOSE2013)
Research @ RELEASeD (presented at SATTOSE2013)Research @ RELEASeD (presented at SATTOSE2013)
Research @ RELEASeD (presented at SATTOSE2013)
 
DataScience SG | Undergrad Series | 26th Sep 19
DataScience SG | Undergrad Series | 26th Sep 19DataScience SG | Undergrad Series | 26th Sep 19
DataScience SG | Undergrad Series | 26th Sep 19
 
EE-232-LEC-01 Data_structures.pptx
EE-232-LEC-01 Data_structures.pptxEE-232-LEC-01 Data_structures.pptx
EE-232-LEC-01 Data_structures.pptx
 
Jarrar: Data Schema Integration
Jarrar: Data Schema IntegrationJarrar: Data Schema Integration
Jarrar: Data Schema Integration
 
Profile Analysis of Users in Data Analytics Domain
Profile Analysis of   Users in Data Analytics DomainProfile Analysis of   Users in Data Analytics Domain
Profile Analysis of Users in Data Analytics Domain
 
PLACEMENTS ANALYTICS AND DASHBOARD
PLACEMENTS ANALYTICS AND DASHBOARDPLACEMENTS ANALYTICS AND DASHBOARD
PLACEMENTS ANALYTICS AND DASHBOARD
 
CS8592_Notes_008_edubuzz360.pdf
CS8592_Notes_008_edubuzz360.pdfCS8592_Notes_008_edubuzz360.pdf
CS8592_Notes_008_edubuzz360.pdf
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED  ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
 
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGINTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKING
 
Early Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEarly Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data Cubes
 
Jarrar: Data Schema Integration
Jarrar: Data Schema Integration Jarrar: Data Schema Integration
Jarrar: Data Schema Integration
 
OOAD-Unit1.ppt
OOAD-Unit1.pptOOAD-Unit1.ppt
OOAD-Unit1.ppt
 
Semantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningSemantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data Mining
 
Developing Competitive Strategies in Higher Education through Visual Data Mining
Developing Competitive Strategies in Higher Education through Visual Data MiningDeveloping Competitive Strategies in Higher Education through Visual Data Mining
Developing Competitive Strategies in Higher Education through Visual Data Mining
 
Analyzing the solutions of DEA through information visualization and data min...
Analyzing the solutions of DEA through information visualization and data min...Analyzing the solutions of DEA through information visualization and data min...
Analyzing the solutions of DEA through information visualization and data min...
 
SKOS as a key element in Enterprise Linked Data Strategies
SKOS as a key element in Enterprise Linked Data StrategiesSKOS as a key element in Enterprise Linked Data Strategies
SKOS as a key element in Enterprise Linked Data Strategies
 
Concept Based Search
Concept Based SearchConcept Based Search
Concept Based Search
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Mapping Domain Names to Categories

  • 1. Mapping Domain Names to Categories Maya Rotmensch, Sorcha Gilroy, Corina Gur˘au Academic Mentor: Cristina Garcia-Cardona Industry Sponsor: Oversee.net (Kryztof Urban) Institute of Pure and Applied Mathematics Research in Industrial Projects August 15, 2013 Institute for Pure & Applied Mathematics University of California, Los Angeles (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 1 / 41
  • 2. Outline 1 Oversee.net 2 Problem Statement Why so complicated? ESA - Explicit Semantic Analysis How Oversee.net Does It 3 Our Project Our Focus Methodology Results 4 Concluding Remarks (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 2 / 41
  • 3. Outline 1 Oversee.net 2 Problem Statement Why so complicated? ESA - Explicit Semantic Analysis How Oversee.net Does It 3 Our Project Our Focus Methodology Results 4 Concluding Remarks (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 3 / 41
  • 4. Oversee.net’s Business Model Person Website (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 4 / 41
  • 5. Person looking for games A gaming website (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 5 / 41
  • 6. Oversee.net’s Business Model Person looking for games Domain A gaming website Direct Navigation: when users navigate to a website by using the address bar instead of a search engine. looking for a gaming website → navigates to ’addictinggamas.com’ (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 6 / 41
  • 7. Oversee.net’s Business Model Domain parking + traffic matching −→ Oversee.net Person Domain Category Website Monetized Domain Parking The registration of internet domain names without placing any content on the domain. Owners monetize traffic by displaying links and advertisements (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 7 / 41
  • 8. Oversee.net’s Business Model Advertisers Partners of Oversee.net Choose the types of traffic they want from Oversee.net’s category tree (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 8 / 41
  • 9. Oversee.net’s Business Model Parked domains do not have any content Mapping Domains to Categories is extremely difficult Oversee.net uses Keywords to describe Domains and Categories Domain Keywords Keywords Category Not enough, as we are not guaranteed use of same language! (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 9 / 41
  • 10. Outline 1 Oversee.net 2 Problem Statement Why so complicated? ESA - Explicit Semantic Analysis How Oversee.net Does It 3 Our Project Our Focus Methodology Results 4 Concluding Remarks (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 10 / 41
  • 11. So what’s the big deal? Reasoning about concepts Scarcity of input information Example 1 - Spelling error cheapvacatins.com Example 2 - Ambiguous meaning bigbearhuts.com (animals? huts? it’s supposed to be winter sports) (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 11 / 41
  • 12. Text Categorization Our problem can be thought of as a problem of categorization. We need to assign a domain to one or more classes or categories A natural choice is topic modeling However, unlike most text categorization problems, we don’t actually have documents to classify, as we are dealing with undeveloped domains (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 12 / 41
  • 13. Topic Modeling This method analyzes the relationships between documents in a corpus by isolating a set of topics from the documents For meaningful results, one must work with a set of large texts Our data set consists of keywords, as our domains are undeveloped This method results in organic generation of topics The categories we are attempting to map into are pre-defined (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 13 / 41
  • 14. ESA - Explicit Semantic Analysis Building a Semantic Interpreter Using a Vector Space Model + an exogeneous knowledge base −→ represent the meaning of text 1 # of articles ∼ 3.5 Million # of terms ∼ 45 Million 1 Evgeniy Gabrilovich and Shaul Markovitch. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis, 2007. Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI) (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 14 / 41
  • 15. ESA - Explicit Semantic Analysis Government Finance Toys Children Bank School . . . Law 0.2 0.3 0.8 0.9 0.2 0.7 . . . Article2 0.8 0.9 0.1 0.3 0.7 0.5 . . . Article3 0.5 0.2 0.3 0.6 0.4 0.8 . . . Article4 0.1 0.2 0.1 0.3 0.4 0.2 . . . ... ... ... ... ... ... ... ... Term frequency inverse document frequency: tfidf (t, d, D) = tf (t, d) × idf (t, D) Logarithmically scaled term frequency: tf (t, d) = log(f (t, d) + 1) Inverse document frequency: idf (t, D) = log |D| |d ∈ D : t ∈ d| (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 15 / 41
  • 16. ESA - Explicit Semantic Analysis Using a Semantic Interpreter Cosine similarity measure similarity = cos(θ) = A · B ||A|| ||B|| (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 16 / 41
  • 17. How Oversee.net Does It Instead of comparing two texts - compare two small sets of words! Use keywords to describe domains and categories Represent these keywords in terms of DBpedia articles A keyword is significantly related to an article if the TF-IDF is above a certain threshold The set of articles associated to a domain/category is the union of the sets of articles associated to its keywords (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 17 / 41
  • 18. How Oversee Does It Compare the two sets of articles (A - domains, B - categories) using the Jaccard Index: J(A, B) = |A ∩ B| |A ∪ B| Categories with highest scores using this index are matched to a domain (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 18 / 41
  • 19. Outline 1 Oversee.net 2 Problem Statement Why so complicated? ESA - Explicit Semantic Analysis How Oversee.net Does It 3 Our Project Our Focus Methodology Results 4 Concluding Remarks (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 19 / 41
  • 20. Our Focus Domain Keywords Keywords Category Critical link: domains to keywords Improve quality of keywords Click Through Rate String Similarity Semantic Analysis Keyword CTR String Similarity Semantic Similarity industrial 20 80 0 industriel 20 89 0 industrie 20 100 0 china manufacturer 20 0 88 industries 20 80 98 industrial companies 20 0 86 (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 20 / 41
  • 21. Domain Keywords Focusing on developing the link between domains and keywords, the two main questions we posed for our research were: Could we use ESA to extend the number of meaningful keywords per domain? Could we use the keywords obtained through Oversee.net inhouse statistics as the basis of the new keywords? (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 21 / 41
  • 22. Methodology Extending the set of keywords: (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 22 / 41
  • 23. Methodology Extending the set of keywords: When generating new keywords: Only take top 3 articles Only take top 2 terms (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 23 / 41
  • 24. Methodology Method 2 for extending the set of keywords: Breaking up and correcting the domain name chaselogon.com haselogon aselogon cha selogon chas elogon chase logon chasel ogon chaselo gon chaselog chaselogo Example: domain = ’chaselogon.com’ If entire string matches a word in reference file then stop If both parts of broken string are exact words then stop If substring is an exact word then correct other part using edit distances Corrections used: deletions, transpositions, replacements, insertions (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 24 / 41
  • 25. Methodology Method 2 for extending the set of keywords: Reference file made up of collections of text, have added more information Company names Popular websites Brand and store names Countries and major cities Initial Keywords Keywords after parsing chameloeon chas chase elson login (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 25 / 41
  • 26. Methodology Generating new keywords and mapping to categories bankfianancial.com ncofinancial ban bank financial financial institutions financial centre lobsters official personal societies chairman . . . Jaccard Index = 0.240492 finance retirement pension debit card tenant credit check ... Jaccard Index = 0.348147 credit cards debit card credit applications rewards program ... Jaccard Index = 0.219457 banking savings banking checks community bank ... (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 26 / 41
  • 27. Results: Comparing Their Keywords to Semantic We were given a sample of 300 domains that had been matched by hand to a total of 500 categories CTR & String Similarity CTR, String Similarity & Semantic Analysis Number of matches 25 309 percentage of match 5% 61.8% (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 27 / 41
  • 28. Results: Generating New Keywords Using Method 1: CTR & String Similarity Method 1 CTR & String Similarity & 7 Random Number of matches 25 21 24 percentage of match 5% 4.2% 4.8% Most of the time, the different methods yielded the same results Cases where the new keywords improved the system: thhetrainline.com Cases where the base case did better: inindustries.com (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 28 / 41
  • 29. Results thhetrainline.com thetrainline Jaccard Index = 0.0001 microcars & city cars Jaccard Index = 0.0002 property management thhetrainline.com thetrainline strafe train moving departing train station telecommunications georgia rain shine . . . Jaccard Index = 0.1348 bus & rail Jaccard Index = 0.2255 libraries & museums (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 29 / 41
  • 30. Results inindustries.com industrial industrias industriel . . . Jaccard Index = 0.0786 manufacturing inindustries.com industrial industrias industriel . . . ministry quarterly garden/outdoor filipino footballer . . . Jaccard Index = 0.099 tourist destinations Jaccard Index = 0.1326 real estate (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 30 / 41
  • 31. Results: Parsing the Domains Using Method 1 & 2: CTR & String Similarity Method 1 & 2 CTR & String Similarity & 15 Random Number of matches 25 93 23 percentage of match 5% 18.6% 4.6% (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 31 / 41
  • 32. Results - Parsing the Domains chaselogon.com chameloeon No category matched addictinggamas.com chameloeon chas chase elson login password journalists cyber logins expensive beatles . . . Jaccard Index =0.4637 credit cards Jaccard Index = 0.4637 banking (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 32 / 41
  • 33. Results: Parsing the Domains Using Method 2: CTR & String Sim. Method 1& 2 Method 2 Number of matches 25 97 77 out of 356 percentage of match 5% 19.4% ∼ 21.6 % Initial results show that overall, just using parsing might be more beneficial → depends on the amount of noise. Example with a lot of noise: mobilestorage.ca Example with minimal noise: addictinggamas.com (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 33 / 41
  • 34. Results - Amplification of noise mobilestorage.ca gfilestorage mobileshop mobile storage age investor vilest . . . Jaccard Index = 0.1011 mobile & wireless Jaccard Index = 0.0959 music & audio mobilestorage.ca gfilestorage mobileshop mobile storage age investor vilest . . . legal age taylor phone companies mobil . . . Jaccard Index =0.0942 music & audio Jaccard Index = 0.0887 education (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 34 / 41
  • 35. Results - Minimal noise addictinggamas.com addictinggams addictivegames adictigegames . . . addict addicting games ingram . . . Jaccard Index = 0.0153 software addictinggamas.com addictinggams addictivegames adictigegames . . . addict addicting games ingram . . . gameplay requires game impulsedriven flash add ons . . . Jaccard Index = 0.2019 computer & video games Jaccard Index = 0.1975 games (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 35 / 41
  • 36. Results: Extended Matches Using Extended Matches: We extended possible matches to parent and root nodes of the category tree. Checked in how many cases did the parent or root node of the categories we got matched the manual matching. CTR & String Sim. Method 1 Method 1& 2 Method 2 Number of matches 25 21 97 77 out of 356 percentage of match 5% 4.2% 19.4% ∼ 21.6 % Number of extended matches 32 29 128 102 out of 356 Percentage of matches 6.4% 5.8% 25.6% ∼ 28.7 % (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 36 / 41
  • 37. Outline 1 Oversee.net 2 Problem Statement Why so complicated? ESA - Explicit Semantic Analysis How Oversee.net Does It 3 Our Project Our Focus Methodology Results 4 Concluding Remarks (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 37 / 41
  • 38. Conclusion Implemented a program to match domains with categories Created an ESA based method to amplify existing keywords Adapted a domain name parsing and spell correcting method Revisiting our research questions: Could we use ESA to extend the number of meaningful keywords per domain? → Yes Could we use the keywords obtained through Oversee.net inhouse statistics as the basis of the new keywords? → No. Or at least further processing must be done. getting better & more keywords → getting a few good keywords (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 38 / 41
  • 39. Future Directions Find out how many good initial keywords are required to use our method successfully Explore a better way of ranking keywords and determine which are the most descriptive ones Click through rate and string similarity comparisons are not sufficiently descriptive, need a better scoring method Have a reference of the most popular websites, so that the domains given could be compared to these Analyze content in websites to amplify domain to category mapping (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 39 / 41
  • 40. Thank you! Academic Mentor: Cristina Garcia-Cardona Industry Sponsor: Kryztof Urban and Oversee.net RIPS Director: Dr. Michael Raugh Director of IPAM: Dr. Russ Caflisch IPAM Staff: Dimi, Stacey, Stacy, Roland, Stephanie, and everyone that made RIPS possible (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 40 / 41
  • 41. Questions? Thank you for listening! (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 41 / 41