SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Iden%fying	
  news	
  clusters	
  using	
  
Q-­‐analysis	
  and	
  Modularity	
  
David	
  Rodrigues+	
  
Centre	
  for	
  Complexity	
  and	
  Design	
  
+The Open University, UK – david.rodrigues@open.ac.uk
	
  
1	
  
v	
  
complexityanddesign.com	
  
Thursday	
  am	
  –	
  Room	
  S11	
  
2	
  
Complexity	
  &	
  Design	
  Workshop	
  at	
  ECCS13	
  
Mo%va%on	
  
•  Find	
  Structure	
  in	
  collec%ons	
  of	
  text	
  documents	
  
•  Create	
  Computer	
  Algorithms	
  to	
  automate	
  this	
  
discovery	
  with	
  minimal	
  human	
  supervision.	
  
•  Use	
  of	
  hybrid	
  methodologies	
  to	
  improve	
  quality	
  of	
  
results	
  
–  Topology	
  based	
  approach	
  describes	
  data	
  
–  Clustering	
  technique	
  to	
  iden%fy	
  modules	
  
3	
  
Problem	
  Descrip%on	
  
•  Iden%fy	
  the	
  Structure	
  of	
  the	
  news	
  published	
  
online	
  by	
  The	
  Guardian	
  (among	
  other	
  
newspapers)	
  
– Clustering?	
  	
  
– Topology?	
  
– Topic	
  Modelling?	
  
– Noise?	
  
– Novelty?	
  
– Change?	
  
4	
  
[Kohut,	
  A.	
  and	
  Remez,	
  M.	
  (2008)]	
  
Clustering	
  Techniques	
  in	
  Topic	
  
Modelling	
  
•  Nearest	
  neighbour	
  classifica%on	
  
•  Bayesian	
  probabilis%c	
  techniques	
  
•  Decision	
  trees	
  
•  Regression	
  Models	
  
•  Neural	
  Networks	
  
•  Support	
  Vector	
  Machines	
  
•  Language	
  dependent	
  /	
  Human	
  interven%on	
  in	
  the	
  
defini%on	
  of	
  categories	
  for	
  training	
  samples.	
  
5	
  
Clustering	
  in	
  Graphs	
  is	
  Community	
  
Detec%on	
  
•  Modularity	
  based	
  techniques	
  [majority]	
  
•  Spectral	
  algorithms	
  
•  Synchroniza%on	
  based	
  techniques	
  
•  …	
  	
  
•  [Community	
  detecBon	
  in	
  graphs	
  -­‐	
  Fortunato,	
  2010,	
  for	
  comprehensive	
  review]	
  
•  Binary	
  rela%ons	
  between	
  nodes	
  don’t	
  capture	
  
the	
  mul%-­‐level	
  structure	
  of	
  exis%ng	
  rela%ons.	
  
– Move	
  to	
  n-­‐ary	
  rela%ons	
  and	
  descrip%ons	
  
6	
  
Previously	
  
•  We	
  used	
  a	
  sliding	
  window	
  over	
  the	
  %me	
  series	
  
of	
  the	
  news	
  stories	
  
•  Used	
  Varia%on	
  of	
  Informa%on	
  to	
  measure	
  
changes	
  in	
  an	
  evolving	
  adap%ve	
  network	
  of	
  
news[Meilã	
  2007,	
  Rodrigues	
  2010]	
  
7	
  
Our	
  Proposal	
  
•  Use	
  a	
  high	
  dimensional	
  representa%on	
  of	
  the	
  
documents	
  (Simplicial	
  Complex)	
  
•  Use	
  Q-­‐analysis	
  to	
  describe	
  the	
  system	
  
constructed	
  from	
  the	
  Documents	
  x	
  Tags	
  
Incidence	
  Matrix	
  
•  Use	
  Q-­‐connected	
  components	
  to	
  filter	
  noise.	
  
•  Use	
  modularity	
  opBmisaBon	
  to	
  find	
  
communi%es	
  in	
  the	
  resul%ng	
  induced	
  graphs	
  
8	
  
Noise?	
  
•  In	
  the	
  news	
  context,	
  we	
  define	
  noise	
  news	
  as	
  
news	
  that	
  are	
  loosely	
  related	
  to	
  the	
  main	
  
topics	
  published.	
  
•  We	
  can	
  filter	
  them	
  by	
  assuming	
  that	
  the	
  Q-­‐
connectedness	
  of	
  this	
  news	
  is	
  very	
  low.	
  	
  
9	
  
The	
  Guardian	
  
•  Classifies	
  news	
  with	
  useful	
  metadata:	
  
–  …	
  
–  Sec%on	
  
–  Tags	
  
–  …	
  
hkp://www.theguardian.com/open-­‐plalorm	
  
Open	
  Plalorm	
  with	
  API	
  for	
  applica%on	
  development.	
  
	
  
3	
  years	
  of	
  data:	
  2010,	
  2011	
  and	
  2012	
  
10	
  
Pseudo	
  code	
  for	
  the	
  automated	
  news	
  
clustering	
  and	
  filtering	
  algorithm	
  
11	
  
Pseudo	
  code	
  for	
  the	
  automated	
  news	
  
clustering	
  and	
  filtering	
  algorithm	
  
12	
  
Incidence	
  Matrix	
  
TAG	
  1	
   TAG	
  2	
   TAG	
  3	
   TAG	
  4	
   	
  TAG	
  5	
  	
   …	
  
NEWS	
  1	
   1	
   1	
   0	
   0	
   0	
   …	
  
NEWS	
  2	
   0	
   1	
   1	
   0	
   1	
   …	
  
NEWS	
  3	
   0	
   1	
   0	
   0	
   1	
   …	
  
NEWS	
  4	
   1	
   0	
   0	
   0	
   1	
   …	
  
NEWS	
  5	
   0	
   0	
   0	
   1	
   1	
   …	
  
…	
   …	
   …	
   …	
   …	
   …	
   …	
  
13	
  
Documents	
  x	
  Tags	
  
Results	
  
14	
  
Community	
  detec%on	
  on	
  the	
  	
  
0-­‐connected	
  graph	
  
15	
  
1	
  Month	
  of	
  News	
  –	
  
November	
  2011	
  
	
  
Modularity	
  =	
  0.48	
  	
  
9	
  communi%es	
  
Small	
  frac%on	
  of	
  ver%ces	
  is	
  highly	
  
connected	
  
16	
  
Giant	
  component	
  only	
  for	
  low	
  
connected	
  graph	
  
17	
  
Modularity	
  vs.	
  connectedness	
  
18	
  
Number	
  of	
  nodes	
  decreases	
  quickly	
  
with	
  Q	
  
19	
  
Number	
  of	
  nodes	
  and	
  Edge	
  Density	
  
20	
  November	
  2011	
  
Average	
  Clustering	
  and	
  Degree	
  
Assorta%vity	
  
21	
  
n.	
  Components	
  and	
  Modularity	
  
22	
  
Q=5	
  +	
  Modularity	
  
23	
  
Examples	
  Of	
  Clusters	
  (I)	
  
24	
  
Examples	
  Of	
  Clusters	
  (II)	
  
25	
  
Developed	
  Tools	
  
•  Theseus	
  –	
  A	
  python	
  applica%on	
  for	
  collec%ng,	
  	
  
processing	
  and	
  visualisa%on	
  of	
  the	
  textual	
  
dataset	
  -­‐	
  hkps://github.com/sixhat/theseus	
  	
  
•  Visualisa%on	
  tool	
  	
  
26	
  
Visualisa%on	
  Tool	
  
27	
  
Conclusions	
  
•  Q-­‐analysis	
  gives	
  an	
  descrip%ve	
  overview	
  of	
  the	
  
structure	
  of	
  the	
  system,	
  it	
  terms	
  of	
  the	
  local	
  
connec%vity	
  of	
  the	
  news	
  stories.	
  
•  Clustering	
  (on	
  top	
  of	
  the	
  Q-­‐analysis)	
  gives	
  a	
  
natural	
  (highly	
  modular)	
  division	
  of	
  the	
  
resul%ng	
  structures.	
  	
  
•  This	
  allows	
  the	
  iden%fica%on	
  of	
  coherent	
  news	
  
cluster	
  and	
  the	
  filtering	
  of	
  noise	
  news.	
  
28	
  
Generalisa%on	
  of	
  applicability	
  
•  Instead	
  of	
  Human	
  tagged	
  documents,	
  one	
  can	
  
apply	
  this	
  to	
  any	
  kind	
  of	
  text	
  based	
  
documents:	
  
– HTML	
  Webpages:	
  Use	
  keywords	
  tag	
  from	
  header	
  	
  
•  or	
  
– Extract	
  keywords	
  with	
  topic	
  modelling	
  (LDA,	
  for	
  
example)	
  
– Scien%fic	
  Documents:	
  Tag	
  documents	
  with	
  topic	
  
modelling	
  strategies	
  like	
  LDA	
  and	
  instead	
  of	
  noise,	
  
explore	
  the	
  possibility	
  that	
  low	
  connected	
  stories	
  
might	
  be	
  emerging	
  scien%fic	
  trends.	
  
29	
  
Take	
  home	
  message	
  
•  Real	
  Complex	
  Systems	
  are	
  mul%-­‐dimensional.	
  
Community	
  detec%on	
  methods	
  need	
  to	
  take	
  
into	
  account	
  those	
  descrip%ons	
  
•  The	
  construc%on	
  of	
  descrip%ons	
  with	
  all	
  the	
  
rela%ons	
  (hyper-­‐simplicies)	
  gives	
  beker	
  
qualita%ve	
  of	
  the	
  results	
  
•  In	
  the	
  newspapers	
  case,	
  this	
  helps	
  the	
  filtering	
  
of	
  ``noise’’	
  news	
  (unrelated	
  news).	
  
30	
  

Contenu connexe

Tendances

Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Geoffrey Fox
 
Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...
Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...
Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...
Marco Brambilla
 

Tendances (16)

Reranking based-recommender-system-with-deep-learning
Reranking based-recommender-system-with-deep-learningReranking based-recommender-system-with-deep-learning
Reranking based-recommender-system-with-deep-learning
 
Networks, Deep Learning and COVID-19
Networks, Deep Learning and COVID-19Networks, Deep Learning and COVID-19
Networks, Deep Learning and COVID-19
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
 
7
77
7
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
 
Opinion and Consensus Dynamics in Tourism Digital Ecosystems
Opinion and Consensus Dynamics in Tourism Digital EcosystemsOpinion and Consensus Dynamics in Tourism Digital Ecosystems
Opinion and Consensus Dynamics in Tourism Digital Ecosystems
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
 
Feature Subset Selection for High Dimensional Data Using Clustering Techniques
Feature Subset Selection for High Dimensional Data Using Clustering TechniquesFeature Subset Selection for High Dimensional Data Using Clustering Techniques
Feature Subset Selection for High Dimensional Data Using Clustering Techniques
 
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
 
Structural Analysis of Hacktivism on Twitter
Structural Analysis of Hacktivism on TwitterStructural Analysis of Hacktivism on Twitter
Structural Analysis of Hacktivism on Twitter
 
Classification of Big Data Use Cases by different Facets
Classification of Big Data Use Cases by different FacetsClassification of Big Data Use Cases by different Facets
Classification of Big Data Use Cases by different Facets
 
Generating Illustrative Snippets for Open Data on the Web
Generating Illustrative Snippets for Open Data on the WebGenerating Illustrative Snippets for Open Data on the Web
Generating Illustrative Snippets for Open Data on the Web
 
Analyzing rich club behavior in open source projects
Analyzing rich club behavior in open source projectsAnalyzing rich club behavior in open source projects
Analyzing rich club behavior in open source projects
 
Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...
Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...
Generation of Realistic Navigation Paths for Web Site Testing using RNNs and ...
 

Similaire à Identifying news clusters using Q-analysis and Modularity

Machine Learning meets Granular Computing
Machine Learning meets Granular ComputingMachine Learning meets Granular Computing
Machine Learning meets Granular Computing
Jenny Midwinter
 
The FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdfThe FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdf
Alan Morrison
 

Similaire à Identifying news clusters using Q-analysis and Modularity (20)

Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerAutomatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
 
Relationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine LearningRelationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine Learning
 
Mohan C R CV
Mohan C R CVMohan C R CV
Mohan C R CV
 
Assessing data dissemination strategies
Assessing data dissemination strategiesAssessing data dissemination strategies
Assessing data dissemination strategies
 
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
Research in Intelligent Systems and Data Science at the Knowledge Media Insti...
 
2014 01-ticosa
2014 01-ticosa2014 01-ticosa
2014 01-ticosa
 
Overview of DuraMat software tool development
Overview of DuraMat software tool developmentOverview of DuraMat software tool development
Overview of DuraMat software tool development
 
Gray-Box Models for Performance Assessment of Spark Applications
Gray-Box Models for Performance Assessment of Spark ApplicationsGray-Box Models for Performance Assessment of Spark Applications
Gray-Box Models for Performance Assessment of Spark Applications
 
GaruaGeo: Global Scale Data Aggregation in Hybrid Edge and Cloud Computing En...
GaruaGeo: Global Scale Data Aggregation in Hybrid Edge and Cloud Computing En...GaruaGeo: Global Scale Data Aggregation in Hybrid Edge and Cloud Computing En...
GaruaGeo: Global Scale Data Aggregation in Hybrid Edge and Cloud Computing En...
 
Graph Analytics: Graph Algorithms Inside Neo4j
Graph Analytics: Graph Algorithms Inside Neo4jGraph Analytics: Graph Algorithms Inside Neo4j
Graph Analytics: Graph Algorithms Inside Neo4j
 
Confessions of an Interdisciplinary Researcher: The Case of High Performance ...
Confessions of an Interdisciplinary Researcher: The Case of High Performance ...Confessions of an Interdisciplinary Researcher: The Case of High Performance ...
Confessions of an Interdisciplinary Researcher: The Case of High Performance ...
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional Data
 
Machine Learning meets Granular Computing
Machine Learning meets Granular ComputingMachine Learning meets Granular Computing
Machine Learning meets Granular Computing
 
The FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdfThe FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdf
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
 
COVID-19 CASES PREDICTION USING MACHINE LEARNING
COVID-19 CASES PREDICTION USING MACHINE LEARNINGCOVID-19 CASES PREDICTION USING MACHINE LEARNING
COVID-19 CASES PREDICTION USING MACHINE LEARNING
 
An Overview of Python for Data Analytics
An Overview of Python for Data AnalyticsAn Overview of Python for Data Analytics
An Overview of Python for Data Analytics
 
The Future is Big Graphs: A Community View on Graph Processing Systems
The Future is Big Graphs: A Community View on Graph Processing SystemsThe Future is Big Graphs: A Community View on Graph Processing Systems
The Future is Big Graphs: A Community View on Graph Processing Systems
 

Dernier

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Dernier (20)

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 

Identifying news clusters using Q-analysis and Modularity

  • 1. Iden%fying  news  clusters  using   Q-­‐analysis  and  Modularity   David  Rodrigues+   Centre  for  Complexity  and  Design   +The Open University, UK – david.rodrigues@open.ac.uk   1  
  • 2. v   complexityanddesign.com   Thursday  am  –  Room  S11   2   Complexity  &  Design  Workshop  at  ECCS13  
  • 3. Mo%va%on   •  Find  Structure  in  collec%ons  of  text  documents   •  Create  Computer  Algorithms  to  automate  this   discovery  with  minimal  human  supervision.   •  Use  of  hybrid  methodologies  to  improve  quality  of   results   –  Topology  based  approach  describes  data   –  Clustering  technique  to  iden%fy  modules   3  
  • 4. Problem  Descrip%on   •  Iden%fy  the  Structure  of  the  news  published   online  by  The  Guardian  (among  other   newspapers)   – Clustering?     – Topology?   – Topic  Modelling?   – Noise?   – Novelty?   – Change?   4   [Kohut,  A.  and  Remez,  M.  (2008)]  
  • 5. Clustering  Techniques  in  Topic   Modelling   •  Nearest  neighbour  classifica%on   •  Bayesian  probabilis%c  techniques   •  Decision  trees   •  Regression  Models   •  Neural  Networks   •  Support  Vector  Machines   •  Language  dependent  /  Human  interven%on  in  the   defini%on  of  categories  for  training  samples.   5  
  • 6. Clustering  in  Graphs  is  Community   Detec%on   •  Modularity  based  techniques  [majority]   •  Spectral  algorithms   •  Synchroniza%on  based  techniques   •  …     •  [Community  detecBon  in  graphs  -­‐  Fortunato,  2010,  for  comprehensive  review]   •  Binary  rela%ons  between  nodes  don’t  capture   the  mul%-­‐level  structure  of  exis%ng  rela%ons.   – Move  to  n-­‐ary  rela%ons  and  descrip%ons   6  
  • 7. Previously   •  We  used  a  sliding  window  over  the  %me  series   of  the  news  stories   •  Used  Varia%on  of  Informa%on  to  measure   changes  in  an  evolving  adap%ve  network  of   news[Meilã  2007,  Rodrigues  2010]   7  
  • 8. Our  Proposal   •  Use  a  high  dimensional  representa%on  of  the   documents  (Simplicial  Complex)   •  Use  Q-­‐analysis  to  describe  the  system   constructed  from  the  Documents  x  Tags   Incidence  Matrix   •  Use  Q-­‐connected  components  to  filter  noise.   •  Use  modularity  opBmisaBon  to  find   communi%es  in  the  resul%ng  induced  graphs   8  
  • 9. Noise?   •  In  the  news  context,  we  define  noise  news  as   news  that  are  loosely  related  to  the  main   topics  published.   •  We  can  filter  them  by  assuming  that  the  Q-­‐ connectedness  of  this  news  is  very  low.     9  
  • 10. The  Guardian   •  Classifies  news  with  useful  metadata:   –  …   –  Sec%on   –  Tags   –  …   hkp://www.theguardian.com/open-­‐plalorm   Open  Plalorm  with  API  for  applica%on  development.     3  years  of  data:  2010,  2011  and  2012   10  
  • 11. Pseudo  code  for  the  automated  news   clustering  and  filtering  algorithm   11  
  • 12. Pseudo  code  for  the  automated  news   clustering  and  filtering  algorithm   12  
  • 13. Incidence  Matrix   TAG  1   TAG  2   TAG  3   TAG  4    TAG  5     …   NEWS  1   1   1   0   0   0   …   NEWS  2   0   1   1   0   1   …   NEWS  3   0   1   0   0   1   …   NEWS  4   1   0   0   0   1   …   NEWS  5   0   0   0   1   1   …   …   …   …   …   …   …   …   13   Documents  x  Tags  
  • 15. Community  detec%on  on  the     0-­‐connected  graph   15   1  Month  of  News  –   November  2011     Modularity  =  0.48     9  communi%es  
  • 16. Small  frac%on  of  ver%ces  is  highly   connected   16  
  • 17. Giant  component  only  for  low   connected  graph   17  
  • 19. Number  of  nodes  decreases  quickly   with  Q   19  
  • 20. Number  of  nodes  and  Edge  Density   20  November  2011  
  • 21. Average  Clustering  and  Degree   Assorta%vity   21  
  • 22. n.  Components  and  Modularity   22  
  • 24. Examples  Of  Clusters  (I)   24  
  • 25. Examples  Of  Clusters  (II)   25  
  • 26. Developed  Tools   •  Theseus  –  A  python  applica%on  for  collec%ng,     processing  and  visualisa%on  of  the  textual   dataset  -­‐  hkps://github.com/sixhat/theseus     •  Visualisa%on  tool     26  
  • 28. Conclusions   •  Q-­‐analysis  gives  an  descrip%ve  overview  of  the   structure  of  the  system,  it  terms  of  the  local   connec%vity  of  the  news  stories.   •  Clustering  (on  top  of  the  Q-­‐analysis)  gives  a   natural  (highly  modular)  division  of  the   resul%ng  structures.     •  This  allows  the  iden%fica%on  of  coherent  news   cluster  and  the  filtering  of  noise  news.   28  
  • 29. Generalisa%on  of  applicability   •  Instead  of  Human  tagged  documents,  one  can   apply  this  to  any  kind  of  text  based   documents:   – HTML  Webpages:  Use  keywords  tag  from  header     •  or   – Extract  keywords  with  topic  modelling  (LDA,  for   example)   – Scien%fic  Documents:  Tag  documents  with  topic   modelling  strategies  like  LDA  and  instead  of  noise,   explore  the  possibility  that  low  connected  stories   might  be  emerging  scien%fic  trends.   29  
  • 30. Take  home  message   •  Real  Complex  Systems  are  mul%-­‐dimensional.   Community  detec%on  methods  need  to  take   into  account  those  descrip%ons   •  The  construc%on  of  descrip%ons  with  all  the   rela%ons  (hyper-­‐simplicies)  gives  beker   qualita%ve  of  the  results   •  In  the  newspapers  case,  this  helps  the  filtering   of  ``noise’’  news  (unrelated  news).   30