SlideShare une entreprise Scribd logo
1  sur  33
Hubness-Based Fuzzy
Measures
for High-Dimensional
k-Nearest Neighbor
Classification
Nenad Tomašev, Miloš
Radovanović, Dunja
Mladenić, Mirjana Ivanović
Presentation outline
•   The phenomenon of
    hubness
•   Why it matters: a
    motivating example
•   Types of hubness
•   Exploiting hubness
    information in kNN:
    hubness-based fuzzy
    measures
•   Anti-hubs: a problem?
•   Approximative
    approaches
•   Experimental evaluation
•   Conclusions and future
    work
Hubness
 One   consequence of the well-known
  dimensionality curse
 Influential points emerge in nearest-
  neighbor methods: HUBS
 These hubs appear in many k-neighbor
  sets
What was that song again?
 Hubness:  first noticed in music
  collection mining
 Some songs were being retrieved
  (as nearest neighbors) much more
  often than other songs
 This did not, however, reflect the
  perceived similarity between the
  songs
Hubness: definitions
   Hubs: points which appear often as neighbors
     Influential points
     Rare among the data


   Anti-hubs: points which nearly never appear as neighbors
     Possible outliers
     Common among the data


   k-occurrence: a point occurs in a k-neighbor set


   Nk(x): the number of k-occurrences of point x
k-occurrence distribution




(graphs taken from: Radovanović, Nanopulous,
Ivanović - Hubs in Space: Popular Nearest
Neighbors in High-Dimensional Data)
k-occurrence distribution




(graphs taken from: Radovanović, Nanopulous,
Ivanović - Hubs in Space: Popular Nearest
Neighbors in High-Dimensional Data)
What causes hubness
 At first it was thought that some data
  distributions or some metrics might be the
  underlying cause
 The truth is much simpler than that:
  hubness is present in almost any inherently
  high-dimensional data
High dimensional data
 Image  data, video, audio,
  measurement streams,
  medical records, text, …
 Modern machine learning
  challenges are all of an
  inherently high dimensional
  nature
The curse of dimensionality
   Everything is sparse
     The requirements for proper density estimates
      rise exponentially with dimensionality
     The notions of structure and „shape‟ of clusters
      are much less meaningful, since there is not
      enough data to capture these higher-order
      dependencies
   Concentration of distances
     The relative contrast decreases
     Distance expectation increases, but the
      variance remains constant
Related work
   Work by Miloš Radovanović et al.
       The general properties of the hubness
        phenomenon
       Hubness-weighted kNN
       Hubness-based outlier detection
       Hubs and anti-hubs in SVM
       Time series classification
       …
   Work by Krisztian Buza et al.
       Instance selection / data reduction based on
        hubness
       Time series classification by using hubness
Hubness-weighted kNN: the
idea


 Good        Bad       Total
hubness    hubness    hubness

 GNk(x)     BNk(x)      Nk(x)
Hubness-weighted kNN: the
weights

   A simple, yet effective instance-specific weighting
    scheme




   This was the second baseline (the first was kNN) in the
    experiments
How bad can bad hubness
become?
With devastating results
Hubness-based fuzzy measures
 The  idea: explore the structure of bad
  hubness
 Introducing the concept of class hubness
 In other words:
    There is nothing inherently good or bad
     about a k-occurrence, what it does is that,
     as a random event, it carries some
     information about the label of the point of
     interest
The fuzzy k-nearest neighbor
framework
 Each   neighbor distributes its vote across
  all the categories
 Also, some distance-based weighting
The proposed hubness-based
fuzziness
 Use  class hubness to vote define fuziness,
  if a data point exhibits enough hubness
 If not, plan B
Anti-hubs: a problem?
 There   exist points which never appear in k-
  neighbor sets on the training data
 However, there are even more points
  which simply appear rarely
 So, we have to be careful
 On the other hands, these points will most
  likely occur rarely on the test data as well
The low dimensional case…
The high dimensional case
Approximative approaches
 Use point label
 Use global class-to-class hubness estimate
     Captures average class-hubness among
      data points from the same category
 Use   a local estimate
     We tested two different ways of fuzzifying
      the local estimates
Experimental evaluation
 UCI   data
    low-to-medium hubness data
    many binary classification problems
 ImageNet     data
    5 multiclass classification problems
    SIFT codebook representation
    color histograms
The distance weighting
 An   optional part of the algorithm, so we
  decided to see if it makes a difference
 It turns out that it does lead to slightly
  better results
 Notation:
     h-FNN – the non-weighted version
     dwh-FNN – the weighted version
Comparison between the
estimates
Neighborhood sizes
Results on ImageNet data
kNN




h-FNN
Conclusions
 Class
      hubness can be successfully
 exploited in a fuzzy voting scheme for k-
 nearest neighbor classification

 Anti-hubscan be treated as a separate
 case, in any of the proposed ways,
 without compromising the accuracy
Conclusions
 Thephenomenon of hubness, even
 though inherently detrimental, can be
 turned to our advantage by building
 hubness-aware classification algorithms

 There
      is certainly a lot of space for follow-
 ups and potential improvement
Acknowledgements
   This work was supported by the bilateral
    project between Slovenia and Serbia
    “Correlating Images and Words: Enhancing
    Image Analysis Through Machine Learning
    and Semantic Technologies,” the Slovenian
    Research Agency, the Serbian Ministry of
    Education and Science through project no.
    OI174023, “Intelligent techniques and their
    integration into wide-spectrum decision
    support,” and the ICT Programme of the EC
    under PASCAL2 (ICT-NoE-216886) and
    PlanetData (ICT-NoE-257641)
Thank you for your
    attention
   Questions?

Contenu connexe

Similaire à Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classification

Knowledge manifold
Knowledge manifoldKnowledge manifold
Knowledge manifoldLingfei Wu
 
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - 最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - Hiroshi Fukui
 
PR-048: Towards Principled Methods for Training Generative Adversarial Networks
PR-048: Towards Principled Methods for Training Generative Adversarial NetworksPR-048: Towards Principled Methods for Training Generative Adversarial Networks
PR-048: Towards Principled Methods for Training Generative Adversarial NetworksJi-Hoon Kim
 
Chap10 Anomaly Detection
Chap10 Anomaly DetectionChap10 Anomaly Detection
Chap10 Anomaly Detectionguest76d673
 
TENSOR VOTING BASED BINARY CLASSIFIER
TENSOR VOTING BASED BINARY CLASSIFIERTENSOR VOTING BASED BINARY CLASSIFIER
TENSOR VOTING BASED BINARY CLASSIFIERcscpconf
 
4 musatov
4 musatov4 musatov
4 musatovYandex
 
Wasserstein GAN an Introduction
Wasserstein GAN an IntroductionWasserstein GAN an Introduction
Wasserstein GAN an IntroductionMartin Heusel
 
Fcv learn yu
Fcv learn yuFcv learn yu
Fcv learn yuzukun
 
Large Scale Image Retrieval 2022.pdf
Large Scale Image Retrieval 2022.pdfLarge Scale Image Retrieval 2022.pdf
Large Scale Image Retrieval 2022.pdfSamuCerezo
 
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Ian Morgan
 
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Bayes Nets meetup London
 
Machine Learning Machine Learnin Machine Learningg
Machine Learning Machine Learnin Machine LearninggMachine Learning Machine Learnin Machine Learningg
Machine Learning Machine Learnin Machine Learninggghsskchutta
 
Download
DownloadDownload
Downloadbutest
 
Download
DownloadDownload
Downloadbutest
 

Similaire à Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classification (20)

Knowledge manifold
Knowledge manifoldKnowledge manifold
Knowledge manifold
 
Semantic and Diverse Summarization of Egocentric Photo Events
Semantic and Diverse Summarization of Egocentric Photo EventsSemantic and Diverse Summarization of Egocentric Photo Events
Semantic and Diverse Summarization of Egocentric Photo Events
 
最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に - 最近の研究情勢についていくために - Deep Learningを中心に -
最近の研究情勢についていくために - Deep Learningを中心に -
 
PR-048: Towards Principled Methods for Training Generative Adversarial Networks
PR-048: Towards Principled Methods for Training Generative Adversarial NetworksPR-048: Towards Principled Methods for Training Generative Adversarial Networks
PR-048: Towards Principled Methods for Training Generative Adversarial Networks
 
I0341042048
I0341042048I0341042048
I0341042048
 
Chap10 Anomaly Detection
Chap10 Anomaly DetectionChap10 Anomaly Detection
Chap10 Anomaly Detection
 
TENSOR VOTING BASED BINARY CLASSIFIER
TENSOR VOTING BASED BINARY CLASSIFIERTENSOR VOTING BASED BINARY CLASSIFIER
TENSOR VOTING BASED BINARY CLASSIFIER
 
4 musatov
4 musatov4 musatov
4 musatov
 
Wasserstein GAN an Introduction
Wasserstein GAN an IntroductionWasserstein GAN an Introduction
Wasserstein GAN an Introduction
 
Fcv learn yu
Fcv learn yuFcv learn yu
Fcv learn yu
 
Di35605610
Di35605610Di35605610
Di35605610
 
Large Scale Image Retrieval 2022.pdf
Large Scale Image Retrieval 2022.pdfLarge Scale Image Retrieval 2022.pdf
Large Scale Image Retrieval 2022.pdf
 
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
 
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...
 
MAchine learning
MAchine learningMAchine learning
MAchine learning
 
17.ppt
17.ppt17.ppt
17.ppt
 
PPT-3.ppt
PPT-3.pptPPT-3.ppt
PPT-3.ppt
 
Machine Learning Machine Learnin Machine Learningg
Machine Learning Machine Learnin Machine LearninggMachine Learning Machine Learnin Machine Learningg
Machine Learning Machine Learnin Machine Learningg
 
Download
DownloadDownload
Download
 
Download
DownloadDownload
Download
 

Plus de PlanetData Network of Excellence

A Contextualized Knowledge Repository for Open Data about Trentino
A Contextualized Knowledge Repository for Open Data about TrentinoA Contextualized Knowledge Repository for Open Data about Trentino
A Contextualized Knowledge Repository for Open Data about TrentinoPlanetData Network of Excellence
 
On Leveraging Crowdsourcing Techniques for Schema Matching Networks
On Leveraging Crowdsourcing Techniques for Schema Matching NetworksOn Leveraging Crowdsourcing Techniques for Schema Matching Networks
On Leveraging Crowdsourcing Techniques for Schema Matching NetworksPlanetData Network of Excellence
 
Towards Enabling Probabilistic Databases for Participatory Sensing
Towards Enabling Probabilistic Databases for Participatory SensingTowards Enabling Probabilistic Databases for Participatory Sensing
Towards Enabling Probabilistic Databases for Participatory SensingPlanetData Network of Excellence
 
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstream
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstreamDemo: tablet-based visualisation of transport data in Madrid using SPARQLstream
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstreamPlanetData Network of Excellence
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingPlanetData Network of Excellence
 
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...PlanetData Network of Excellence
 
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatch
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatchLinking Smart Cities Datasets with Human Computation: the case of UrbanMatch
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatchPlanetData Network of Excellence
 
SciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMSSciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMSPlanetData Network of Excellence
 
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduce
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduceScalable Nonmonotonic Reasoning over RDF Data Using MapReduce
Scalable Nonmonotonic Reasoning over RDF Data Using MapReducePlanetData Network of Excellence
 
Evolution of Workflow Provenance Information in the Presence of Custom Infere...
Evolution of Workflow Provenance Information in the Presence of Custom Infere...Evolution of Workflow Provenance Information in the Presence of Custom Infere...
Evolution of Workflow Provenance Information in the Presence of Custom Infere...PlanetData Network of Excellence
 
Towards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of FactsTowards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of FactsPlanetData Network of Excellence
 
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...PlanetData Network of Excellence
 

Plus de PlanetData Network of Excellence (20)

Dl2014 slides
Dl2014 slidesDl2014 slides
Dl2014 slides
 
A Contextualized Knowledge Repository for Open Data about Trentino
A Contextualized Knowledge Repository for Open Data about TrentinoA Contextualized Knowledge Repository for Open Data about Trentino
A Contextualized Knowledge Repository for Open Data about Trentino
 
On Leveraging Crowdsourcing Techniques for Schema Matching Networks
On Leveraging Crowdsourcing Techniques for Schema Matching NetworksOn Leveraging Crowdsourcing Techniques for Schema Matching Networks
On Leveraging Crowdsourcing Techniques for Schema Matching Networks
 
Towards Enabling Probabilistic Databases for Participatory Sensing
Towards Enabling Probabilistic Databases for Participatory SensingTowards Enabling Probabilistic Databases for Participatory Sensing
Towards Enabling Probabilistic Databases for Participatory Sensing
 
Privacy-Preserving Schema Reuse
Privacy-Preserving Schema ReusePrivacy-Preserving Schema Reuse
Privacy-Preserving Schema Reuse
 
Pay-as-you-go Reconciliation in Schema Matching Networks
Pay-as-you-go Reconciliation in Schema Matching NetworksPay-as-you-go Reconciliation in Schema Matching Networks
Pay-as-you-go Reconciliation in Schema Matching Networks
 
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstream
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstreamDemo: tablet-based visualisation of transport data in Madrid using SPARQLstream
Demo: tablet-based visualisation of transport data in Madrid using SPARQLstream
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
Urbanopoly: Collection and Quality Assessment of Geo-spatial Linked Data via ...
 
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatch
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatchLinking Smart Cities Datasets with Human Computation: the case of UrbanMatch
Linking Smart Cities Datasets with Human Computation: the case of UrbanMatch
 
SciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMSSciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMS
 
CLODA: A Crowdsourced Linked Open Data Architecture
CLODA: A Crowdsourced Linked Open Data ArchitectureCLODA: A Crowdsourced Linked Open Data Architecture
CLODA: A Crowdsourced Linked Open Data Architecture
 
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduce
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduceScalable Nonmonotonic Reasoning over RDF Data Using MapReduce
Scalable Nonmonotonic Reasoning over RDF Data Using MapReduce
 
Data and Knowledge Evolution
Data and Knowledge Evolution  Data and Knowledge Evolution
Data and Knowledge Evolution
 
Evolution of Workflow Provenance Information in the Presence of Custom Infere...
Evolution of Workflow Provenance Information in the Presence of Custom Infere...Evolution of Workflow Provenance Information in the Presence of Custom Infere...
Evolution of Workflow Provenance Information in the Presence of Custom Infere...
 
Access Control for RDF graphs using Abstract Models
Access Control for RDF graphs using Abstract ModelsAccess Control for RDF graphs using Abstract Models
Access Control for RDF graphs using Abstract Models
 
Arrays in Databases, the next frontier?
Arrays in Databases, the next frontier?Arrays in Databases, the next frontier?
Arrays in Databases, the next frontier?
 
Abstract Access Control Model for Dynamic RDF Datasets
Abstract Access Control Model for Dynamic RDF DatasetsAbstract Access Control Model for Dynamic RDF Datasets
Abstract Access Control Model for Dynamic RDF Datasets
 
Towards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of FactsTowards Parallel Nonmonotonic Reasoning with Billions of Facts
Towards Parallel Nonmonotonic Reasoning with Billions of Facts
 
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
Automation in Cytomics: A Modern RDBMS Based Platform for Image Analysis and ...
 

Dernier

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classification

  • 1. Hubness-Based Fuzzy Measures for High-Dimensional k-Nearest Neighbor Classification Nenad Tomašev, Miloš Radovanović, Dunja Mladenić, Mirjana Ivanović
  • 2. Presentation outline • The phenomenon of hubness • Why it matters: a motivating example • Types of hubness • Exploiting hubness information in kNN: hubness-based fuzzy measures • Anti-hubs: a problem? • Approximative approaches • Experimental evaluation • Conclusions and future work
  • 3. Hubness  One consequence of the well-known dimensionality curse  Influential points emerge in nearest- neighbor methods: HUBS  These hubs appear in many k-neighbor sets
  • 4. What was that song again?  Hubness: first noticed in music collection mining  Some songs were being retrieved (as nearest neighbors) much more often than other songs  This did not, however, reflect the perceived similarity between the songs
  • 5. Hubness: definitions  Hubs: points which appear often as neighbors  Influential points  Rare among the data  Anti-hubs: points which nearly never appear as neighbors  Possible outliers  Common among the data  k-occurrence: a point occurs in a k-neighbor set  Nk(x): the number of k-occurrences of point x
  • 6. k-occurrence distribution (graphs taken from: Radovanović, Nanopulous, Ivanović - Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data)
  • 7. k-occurrence distribution (graphs taken from: Radovanović, Nanopulous, Ivanović - Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data)
  • 8. What causes hubness  At first it was thought that some data distributions or some metrics might be the underlying cause  The truth is much simpler than that: hubness is present in almost any inherently high-dimensional data
  • 9. High dimensional data  Image data, video, audio, measurement streams, medical records, text, …  Modern machine learning challenges are all of an inherently high dimensional nature
  • 10. The curse of dimensionality  Everything is sparse  The requirements for proper density estimates rise exponentially with dimensionality  The notions of structure and „shape‟ of clusters are much less meaningful, since there is not enough data to capture these higher-order dependencies  Concentration of distances  The relative contrast decreases  Distance expectation increases, but the variance remains constant
  • 11. Related work  Work by Miloš Radovanović et al.  The general properties of the hubness phenomenon  Hubness-weighted kNN  Hubness-based outlier detection  Hubs and anti-hubs in SVM  Time series classification  …  Work by Krisztian Buza et al.  Instance selection / data reduction based on hubness  Time series classification by using hubness
  • 12. Hubness-weighted kNN: the idea Good Bad Total hubness hubness hubness GNk(x) BNk(x) Nk(x)
  • 13. Hubness-weighted kNN: the weights  A simple, yet effective instance-specific weighting scheme  This was the second baseline (the first was kNN) in the experiments
  • 14. How bad can bad hubness become?
  • 16. Hubness-based fuzzy measures  The idea: explore the structure of bad hubness  Introducing the concept of class hubness  In other words:  There is nothing inherently good or bad about a k-occurrence, what it does is that, as a random event, it carries some information about the label of the point of interest
  • 17. The fuzzy k-nearest neighbor framework  Each neighbor distributes its vote across all the categories  Also, some distance-based weighting
  • 18. The proposed hubness-based fuzziness  Use class hubness to vote define fuziness, if a data point exhibits enough hubness  If not, plan B
  • 19. Anti-hubs: a problem?  There exist points which never appear in k- neighbor sets on the training data  However, there are even more points which simply appear rarely  So, we have to be careful  On the other hands, these points will most likely occur rarely on the test data as well
  • 22. Approximative approaches  Use point label  Use global class-to-class hubness estimate  Captures average class-hubness among data points from the same category  Use a local estimate  We tested two different ways of fuzzifying the local estimates
  • 23. Experimental evaluation  UCI data  low-to-medium hubness data  many binary classification problems  ImageNet data  5 multiclass classification problems  SIFT codebook representation  color histograms
  • 24. The distance weighting  An optional part of the algorithm, so we decided to see if it makes a difference  It turns out that it does lead to slightly better results  Notation:  h-FNN – the non-weighted version  dwh-FNN – the weighted version
  • 28.
  • 30. Conclusions  Class hubness can be successfully exploited in a fuzzy voting scheme for k- nearest neighbor classification  Anti-hubscan be treated as a separate case, in any of the proposed ways, without compromising the accuracy
  • 31. Conclusions  Thephenomenon of hubness, even though inherently detrimental, can be turned to our advantage by building hubness-aware classification algorithms  There is certainly a lot of space for follow- ups and potential improvement
  • 32. Acknowledgements  This work was supported by the bilateral project between Slovenia and Serbia “Correlating Images and Words: Enhancing Image Analysis Through Machine Learning and Semantic Technologies,” the Slovenian Research Agency, the Serbian Ministry of Education and Science through project no. OI174023, “Intelligent techniques and their integration into wide-spectrum decision support,” and the ICT Programme of the EC under PASCAL2 (ICT-NoE-216886) and PlanetData (ICT-NoE-257641)
  • 33. Thank you for your attention Questions?