SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Similarity  measurement:Folksonomyvs.LSA Preliminary Results
The Tripartite structure of tagging Folksonomy is a set of triples < user , tag, object> A folksonomy is a tuple F :=(U, T, R, Y) where U, T, and R are finite sets, whose elements are called users, tags and resources. Y is a ternary relation between user, tags and resources.
Del.icio.us Tag distribution Tag distribution Log-log Tag distribution After crawling the delicious.com site, the total number tags (tokens) obtained was 7,528,528, among which the number of types was 188,964.  All the tags are stemmed using the Porter Stemmer and the total number of stemmed tags ended up to be 174,887.
LSA Processing Workflow in R tm = textmatrix(‘dir/‘) tm = lw_logtf(tm) * gw_idf(tm) space = lsa(tm, dims=dimcalc_share()) as.textmatrix(tm)
LSA corpus preparing A total number of 17,085 web pages were crawled and were later parsed to remove all the HTML markups.   Stemming and Stop-word removal The processed corpus:14,993,620 tokens, 259,464 types of words.   Only words with frequency more than 100 were kept to be entered into a word-by-document matrix.  There were 1047 words with frequency more than 100.  The text length kept in the corpus is from ranges from 51 words per document to 4009 words per document . Therefore, the resulting term-document matrix have 3465 columns (documents) and 1082 rows (words). 
LSA document length distribution The text length kept in the corpus is from ranges from 51 words per document to 4009 words per document .
Three similarity measurements Tag Co-coccurrence counts Tag vector cosine similarity LSA
Similarity Measurement Tag Co-occurrence Counts:     1)simple count: how many times two tags are used by the same user to annotate the same resource     2)normalized count: Jaccard IndexThe co-occurrence counts of tag A and tag B divided by the joint frequency of A and B.
Distribution of Tag co-occurrence Counts (simple counts)
Distribution of Tag co-occurrence Counts (normalized)
Measurement 2: Cosine Similarity Based on the co-occurrence vector of each tag with every other tag. Since there are normalized and unnormalized tag co-occurrence counts, we then ended up with two  X and Y are the co-occurrence vectors of two distinct tags.
Distribution of Tag Cosine Similarity
Distribution of Tag Cosine Similarity based on normalized Tag Co-occurrence counts
Results The pair-wise Pearson correlation and Spearman correlation among 5 measurements [Tag-cooccurrence count, Tag cosine similarity, LSA, normalized Tag-cooccurrence count, Tag cosine similarity based on normalized tag-tag cooccurrence matrix]
Correlation
Qualitative Insight – ‘linguistics’ Top 10 “linguistics” related words according to 5 measurements
Correlation between two measurements:normalized tag co-occurrence counts vs. normalized tag cosine P=  0.7547
Similarity  Measurement  Preliminary Results

Contenu connexe

Tendances

Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clusteringguest0edcaf
 
Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Foldersfeiwin
 
Document clustering for forensic analysis
Document clustering for forensic analysisDocument clustering for forensic analysis
Document clustering for forensic analysissrinivasa teja
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification Mahmoud Alfarra
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)KU Leuven
 
Using SweetSpotSimilarity for Solr Fulltext Indexing
Using SweetSpotSimilarity for Solr Fulltext IndexingUsing SweetSpotSimilarity for Solr Fulltext Indexing
Using SweetSpotSimilarity for Solr Fulltext IndexingJay Luker
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)KU Leuven
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionDeeksha thakur
 
A Network-Aware Approach for Searching As-You-Type in Social Media
A Network-Aware Approach for Searching As-You-Type in Social MediaA Network-Aware Approach for Searching As-You-Type in Social Media
A Network-Aware Approach for Searching As-You-Type in Social MediaINRIA-OAK
 
A survey of Stemming Algorithms for Information Retrieval
A survey of Stemming Algorithms for Information RetrievalA survey of Stemming Algorithms for Information Retrieval
A survey of Stemming Algorithms for Information Retrievaliosrjce
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersSriTeja Allaparthi
 
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING mlaij
 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsShubhangi Tandon
 

Tendances (17)

Ghost
GhostGhost
Ghost
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
Ju3517011704
Ju3517011704Ju3517011704
Ju3517011704
 
Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Folders
 
Document clustering for forensic analysis
Document clustering for forensic analysisDocument clustering for forensic analysis
Document clustering for forensic analysis
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
Using SweetSpotSimilarity for Solr Fulltext Indexing
Using SweetSpotSimilarity for Solr Fulltext IndexingUsing SweetSpotSimilarity for Solr Fulltext Indexing
Using SweetSpotSimilarity for Solr Fulltext Indexing
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
 
L0261075078
L0261075078L0261075078
L0261075078
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & Extraction
 
EDI 2009- Advanced Search: What’s Under the Hood of your Favorite Search System?
EDI 2009- Advanced Search: What’s Under the Hood of your Favorite Search System?EDI 2009- Advanced Search: What’s Under the Hood of your Favorite Search System?
EDI 2009- Advanced Search: What’s Under the Hood of your Favorite Search System?
 
A Network-Aware Approach for Searching As-You-Type in Social Media
A Network-Aware Approach for Searching As-You-Type in Social MediaA Network-Aware Approach for Searching As-You-Type in Social Media
A Network-Aware Approach for Searching As-You-Type in Social Media
 
A survey of Stemming Algorithms for Information Retrieval
A survey of Stemming Algorithms for Information RetrievalA survey of Stemming Algorithms for Information Retrieval
A survey of Stemming Algorithms for Information Retrieval
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research Papers
 
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into Texts
 

En vedette

Ectel sem_info_rec_learning_resources_v6.0_20120921_ma
Ectel  sem_info_rec_learning_resources_v6.0_20120921_maEctel  sem_info_rec_learning_resources_v6.0_20120921_ma
Ectel sem_info_rec_learning_resources_v6.0_20120921_maMojisola Erdt née Anjorin
 
Exploiting Semantic Information for Graph-based Recommendations of Learning R...
Exploiting Semantic Information for Graph-based Recommendations of Learning R...Exploiting Semantic Information for Graph-based Recommendations of Learning R...
Exploiting Semantic Information for Graph-based Recommendations of Learning R...Mojisola Erdt née Anjorin
 
Iknow ranking sem_info_v9.0__2012.09.07_anjorin
Iknow ranking sem_info_v9.0__2012.09.07_anjorinIknow ranking sem_info_v9.0__2012.09.07_anjorin
Iknow ranking sem_info_v9.0__2012.09.07_anjorinMojisola Erdt née Anjorin
 
Towards Ranking in Folksonomies for Personalized Recommender Systems in E-Lea...
Towards Ranking in Folksonomies for Personalized Recommender Systems in E-Lea...Towards Ranking in Folksonomies for Personalized Recommender Systems in E-Lea...
Towards Ranking in Folksonomies for Personalized Recommender Systems in E-Lea...Mojisola Erdt née Anjorin
 
Context Determines Content - An Approach to Resource Recommendation in Folkso...
Context Determines Content - An Approach to Resource Recommendation in Folkso...Context Determines Content - An Approach to Resource Recommendation in Folkso...
Context Determines Content - An Approach to Resource Recommendation in Folkso...Mojisola Erdt née Anjorin
 
A Document Similarity Measurement without Dictionaries
A Document Similarity Measurement without DictionariesA Document Similarity Measurement without Dictionaries
A Document Similarity Measurement without Dictionaries鍾誠 陳鍾誠
 

En vedette (9)

Ectel sem_info_rec_learning_resources_v6.0_20120921_ma
Ectel  sem_info_rec_learning_resources_v6.0_20120921_maEctel  sem_info_rec_learning_resources_v6.0_20120921_ma
Ectel sem_info_rec_learning_resources_v6.0_20120921_ma
 
Eval rec algo_crowdsourcing__icalt_2014_ma
Eval rec algo_crowdsourcing__icalt_2014_maEval rec algo_crowdsourcing__icalt_2014_ma
Eval rec algo_crowdsourcing__icalt_2014_ma
 
Exploiting Semantic Information for Graph-based Recommendations of Learning R...
Exploiting Semantic Information for Graph-based Recommendations of Learning R...Exploiting Semantic Information for Graph-based Recommendations of Learning R...
Exploiting Semantic Information for Graph-based Recommendations of Learning R...
 
Iknow ranking sem_info_v9.0__2012.09.07_anjorin
Iknow ranking sem_info_v9.0__2012.09.07_anjorinIknow ranking sem_info_v9.0__2012.09.07_anjorin
Iknow ranking sem_info_v9.0__2012.09.07_anjorin
 
Towards Ranking in Folksonomies for Personalized Recommender Systems in E-Lea...
Towards Ranking in Folksonomies for Personalized Recommender Systems in E-Lea...Towards Ranking in Folksonomies for Personalized Recommender Systems in E-Lea...
Towards Ranking in Folksonomies for Personalized Recommender Systems in E-Lea...
 
Context Determines Content - An Approach to Resource Recommendation in Folkso...
Context Determines Content - An Approach to Resource Recommendation in Folkso...Context Determines Content - An Approach to Resource Recommendation in Folkso...
Context Determines Content - An Approach to Resource Recommendation in Folkso...
 
Rs web context_content__v4.0__20120908_ma
Rs web context_content__v4.0__20120908_maRs web context_content__v4.0__20120908_ma
Rs web context_content__v4.0__20120908_ma
 
Otot Manusia
Otot ManusiaOtot Manusia
Otot Manusia
 
A Document Similarity Measurement without Dictionaries
A Document Similarity Measurement without DictionariesA Document Similarity Measurement without Dictionaries
A Document Similarity Measurement without Dictionaries
 

Similaire à Similarity Measurement Preliminary Results

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
Stop thinking, start tagging - Tag Semantics emerge from Collaborative Verbosity
Stop thinking, start tagging - Tag Semantics emerge from Collaborative VerbosityStop thinking, start tagging - Tag Semantics emerge from Collaborative Verbosity
Stop thinking, start tagging - Tag Semantics emerge from Collaborative VerbosityInovex GmbH
 
Extractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachExtractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachFindwise
 
Chat bot using text similarity approach
Chat bot using text similarity approachChat bot using text similarity approach
Chat bot using text similarity approachdinesh_joshy
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Aspects of broad folksonomies
Aspects of broad folksonomiesAspects of broad folksonomies
Aspects of broad folksonomiesdermotte
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And ClusteringDataminingTools Inc
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And ClusteringDatamining Tools
 
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...IJCSEA Journal
 
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAIDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAijistjournal
 
Identifying the semantic relations on
Identifying the semantic relations onIdentifying the semantic relations on
Identifying the semantic relations onijistjournal
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_fariaPaulo Faria
 

Similaire à Similarity Measurement Preliminary Results (20)

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
K017367680
K017367680K017367680
K017367680
 
L0261075078
L0261075078L0261075078
L0261075078
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Stop thinking, start tagging - Tag Semantics emerge from Collaborative Verbosity
Stop thinking, start tagging - Tag Semantics emerge from Collaborative VerbosityStop thinking, start tagging - Tag Semantics emerge from Collaborative Verbosity
Stop thinking, start tagging - Tag Semantics emerge from Collaborative Verbosity
 
Token
TokenToken
Token
 
Extractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachExtractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised Approach
 
Chat bot using text similarity approach
Chat bot using text similarity approachChat bot using text similarity approach
Chat bot using text similarity approach
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Aspects of broad folksonomies
Aspects of broad folksonomiesAspects of broad folksonomies
Aspects of broad folksonomies
 
Seeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text ClusteringSeeds Affinity Propagation Based on Text Clustering
Seeds Affinity Propagation Based on Text Clustering
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
 
unit-4.ppt
unit-4.pptunit-4.ppt
unit-4.ppt
 
unit 4.ppt
unit 4.pptunit 4.ppt
unit 4.ppt
 
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATAIDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
IDENTIFYING THE SEMANTIC RELATIONS ON UNSTRUCTURED DATA
 
Identifying the semantic relations on
Identifying the semantic relations onIdentifying the semantic relations on
Identifying the semantic relations on
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria
 

Dernier

React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 

Dernier (20)

React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 

Similarity Measurement Preliminary Results

  • 2. The Tripartite structure of tagging Folksonomy is a set of triples < user , tag, object> A folksonomy is a tuple F :=(U, T, R, Y) where U, T, and R are finite sets, whose elements are called users, tags and resources. Y is a ternary relation between user, tags and resources.
  • 3. Del.icio.us Tag distribution Tag distribution Log-log Tag distribution After crawling the delicious.com site, the total number tags (tokens) obtained was 7,528,528, among which the number of types was 188,964. All the tags are stemmed using the Porter Stemmer and the total number of stemmed tags ended up to be 174,887.
  • 4. LSA Processing Workflow in R tm = textmatrix(‘dir/‘) tm = lw_logtf(tm) * gw_idf(tm) space = lsa(tm, dims=dimcalc_share()) as.textmatrix(tm)
  • 5. LSA corpus preparing A total number of 17,085 web pages were crawled and were later parsed to remove all the HTML markups. Stemming and Stop-word removal The processed corpus:14,993,620 tokens, 259,464 types of words. Only words with frequency more than 100 were kept to be entered into a word-by-document matrix. There were 1047 words with frequency more than 100. The text length kept in the corpus is from ranges from 51 words per document to 4009 words per document . Therefore, the resulting term-document matrix have 3465 columns (documents) and 1082 rows (words). 
  • 6. LSA document length distribution The text length kept in the corpus is from ranges from 51 words per document to 4009 words per document .
  • 7. Three similarity measurements Tag Co-coccurrence counts Tag vector cosine similarity LSA
  • 8. Similarity Measurement Tag Co-occurrence Counts: 1)simple count: how many times two tags are used by the same user to annotate the same resource 2)normalized count: Jaccard IndexThe co-occurrence counts of tag A and tag B divided by the joint frequency of A and B.
  • 9. Distribution of Tag co-occurrence Counts (simple counts)
  • 10. Distribution of Tag co-occurrence Counts (normalized)
  • 11. Measurement 2: Cosine Similarity Based on the co-occurrence vector of each tag with every other tag. Since there are normalized and unnormalized tag co-occurrence counts, we then ended up with two X and Y are the co-occurrence vectors of two distinct tags.
  • 12. Distribution of Tag Cosine Similarity
  • 13. Distribution of Tag Cosine Similarity based on normalized Tag Co-occurrence counts
  • 14. Results The pair-wise Pearson correlation and Spearman correlation among 5 measurements [Tag-cooccurrence count, Tag cosine similarity, LSA, normalized Tag-cooccurrence count, Tag cosine similarity based on normalized tag-tag cooccurrence matrix]
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21. Qualitative Insight – ‘linguistics’ Top 10 “linguistics” related words according to 5 measurements
  • 22. Correlation between two measurements:normalized tag co-occurrence counts vs. normalized tag cosine P= 0.7547

Notes de l'éditeur

  1. Spearman’s rank correlation