Machine Learning, Data Mining, Genetic Algorithms, Neural ...butest
The document discusses various machine learning concepts including concept learning, decision trees, genetic algorithms, and neural networks. It provides details on each concept, such as how concept learning uses positive and negative examples to learn concepts, how decision trees use nodes and branches to classify data, and how genetic algorithms and neural networks are modeled after biological processes. It also gives examples of applications for each concept, such as using decision trees for classification and neural networks for tasks like handwriting recognition where explicit rules are difficult to define.
The document discusses using graphs and graph databases to model social network data. It provides an example of a social graph for a person named Emil and his friends. It introduces the concept of a property graph and discusses how Neo4j is a graph database optimized for querying connected data. Cypher is introduced as the graph query language used for Neo4j, which uses pattern matching to query graph structures and relationships.
10-15 511 genetic algorithms and machine learning (alan nochenson)Alan Nochenson
This document discusses machine learning and genetic programming. It defines machine learning as a branch of artificial intelligence that uses data to capture uncertainty in probability distributions in order to yield patterns or predictions. Genetic programming is described as a machine learning technique inspired by biological evolution that uses populations of individuals that are evolved using the Darwinian processes of selection, crossover, and mutation. Examples are given of how genetic programming could be used to evolve a mathematical expression to equal a target number.
Finding the insights hidden in your graph dataDataStax
Linkurious is a graph visualization startup that helps companies understand graph data through its product Linkurious Enterprise. Linkurious Enterprise is a web app and API that allows non-technical users to explore relationships and uncover hidden insights in graph databases. It is compatible with DataStax DSE Graph and has been used by over 200 customers, including NASA, for applications such as anti-money laundering investigations, knowledge management, and network management.
20141015 how graphs revolutionize access managementRik Van Bruggen
This document discusses how graph databases can revolutionize access and identity management. It begins with an introduction to graphs and graph databases, explaining how they are well-suited for complex querying of connected data. The document then argues that graph databases allow for a more accurate representation of real-world identity relationships, which are often multi-dimensional, and enable real-time queries that eliminate the need for integration between different systems. A demonstration of a graph database is provided, followed by examples, licensing information and a question and answer section.
Graphgen aims at helping people prototyping a graph database, by providing a visual tool that ease the generation of nodes and relationships with a Cypher DSL.
Many people struggle with not only creating a good graph model of their domain but also with creating sensible example data to test hypotheses or use-cases.
Graphgen aims at helping people with no time but a good enough understanding of their domain model, by providing a visual dsl for data model generation which borrows heavily on Neo4j Cypher graph query language.
The ascii art allows even non-technical users to write and read model descriptions/configurations as concise as plain english but formal enough to be parseable. The underlying generator combines the DSL inputs (structure, cardinalities and amount-ranges) and combines them with a comprehensive fake data generation library to create real-world-like datasets of medium/arbitrary size and complexity.
Users can create their own models combining the basic building blocks of the dsl and share their data-descriptions with others with a simple link.
Algorithmic trading involves using computer algorithms to automate and execute trades electronically. It began in the 1970s with the introduction of electronic trading systems and has grown significantly, making up over 70% of US equity trading by 2009. Algorithmic trading allows for dividing large orders into many smaller trades to minimize market impact and risk. It provides benefits like lower costs and more control over the trading process, but also raises concerns about its role in increased volatility and events like the 2010 Flash Crash.
Bringing graph technologies to data analysis : the case of Azerbaijan in th...Linkurious
This document discusses using graph technologies to analyze data from the Offshore Leaks dataset regarding offshore accounts. It focuses on investigating the offshore accounts and connections of Azerbaijan's President Ilham Aliyev and his family. The analysis finds that President Aliyev controls offshore companies through his family that could be used to collect funds from businessmen awarded construction contracts. A direct connection is also found between President Aliyev and a businessman through one of the offshore accounts.
This document discusses link analysis and summarizes key points in the following 3 sentences:
The document outlines link analysis techniques such as bibliometrics, measures of similarity, ranking algorithms like PageRank and HITS, and characterization of the web graph structure. Hyperlinks provide valuable information for link-based ranking, structure analysis, detection of communities, and spam detection. Quantitative bibliometric laws and statistics are described to analyze patterns of publication citations.
An Introduction to Neural Networks and Machine LearningChris Nicholls
A nontechnical introduction to neural networks, with many examples and pictures. The first talk given at the Balliol College machine learning reading group.
Reinforcing AML systems with graph technologies.Linkurious
Anti-money laundering (AML) has become complex and costly for institutions and enterprises. Nowadays, to thwart criminal intricate strategies, financial crime units have to gather, monitor and investigate large amounts of connected data.
Graph analysis and visualization technologies can provide an holistic view of the various entities and their relationships to unveil wrongdoings.
Anti-money laundering (AML) has become complex and costly for institutions and enterprises. Graph analysis and visualization technologies like Linkurious are a great fit to help AML analysts fight money laundering.
Discover in this presentation how to automate the monitoring of high risk customers with patterns alerts and how to assess risk-levels by visually investigating suspicious cases.
More information on www.linkurio.us
Introduction to the graph technologies landscapeLinkurious
Graph technologies allow modeling of complex relationships and connections through nodes and edges. There are three main layers of graph technologies: graph databases to store graph data, graph analysis frameworks to analyze large graphs, and graph visualization solutions to interact with graphs. Popular tools in each layer include Neo4j and Titan for databases, Giraph and GraphX for analysis, and Gephi and Cytoscape for visualization. Graph technologies are gaining more attention due to their ability to extract insights from connected data.
This document discusses building a scalable data science platform with R. It describes R as a popular statistical programming language with over 2.5 million users. It notes that while R is widely used, its open source nature means it lacks enterprise capabilities for large-scale use. The document then introduces Microsoft R Server as a way to bring enterprise capabilities like scalability, efficiency, and support to R in order to make it suitable for production use on big data problems. It provides examples of using R Server with Hadoop and HDInsight on the Azure cloud to operationalize advanced analytics workflows from data cleaning and modeling to deployment as web services at scale.
GraphGen: Conducting Graph Analytics over Relational DatabasesPyData
This document discusses GraphGen, a tool for conducting graph analytics over relational databases. It begins by introducing graph analytics and its applications. It then discusses the current state of graph analytics, which is fragmented with no single solution. Most organizations store data relationally and have "hidden" graphs that can be extracted. GraphGen provides a declarative language to define nodes and edges to extract these graphs without ETL. It supports various interfaces like Java, Python, and a web application to enable graph analytics over relational data in an intuitive way.
Who am I and why do I feel that the world is not infinitely perfect? Which technologies should I use to rectify this situation? Enter the graph and the graph traversal.
This presentation introduces the graph model as obvious choice for rich and connected data. Graph Databases are a category of open-source NoSQL datastores which are specialized in storing, handling and querying graph structures efficiently.
Use cases represent the applicability of the graph model across many domains.
Neo4j as the most widely used graph database supports the property graph model, which is explained in detail.
To query a graph database a powerful and expressive but also friendly and easily understandable query language that is tailored for graph patterns is key. Neo4j's Cypher is such a query language developed from the ground up to support expressing challenging use-cases in a comprehensive way.
A series of examples rounds up the presentation to apply the lessons learned.
Racines en haut et feuilles en bas : les arbres en mathstuxette
1. The document discusses methods for clustering and differential analysis of Hi-C matrices, which represent the 3D organization of DNA.
2. It proposes extending Ward's hierarchical clustering to directly use Hi-C similarity matrices while enforcing adjacency constraints. A fast algorithm was also developed.
3. A new method called "treediff" was created to perform differential analysis of Hi-C matrices based on the Wasserstein distance between hierarchical clusterings. Software implementations of these methods were also developed.
Méthodes à noyaux pour l’intégration de données hétérogènestuxette
The document discusses a presentation about multi-omics data integration methods using kernel methods. The presentation introduces kernel methods, how they can be used to integrate heterogeneous omics data, and examples of applications. Specifically, it discusses using kernel methods to perform unsupervised transformation-based integration of multi-omics data. It also presents an application of constrained kernel hierarchical clustering to analyze Hi-C data by directly using Hi-C matrices as kernels.
Machine Learning, Data Mining, Genetic Algorithms, Neural ...butest
The document discusses various machine learning concepts including concept learning, decision trees, genetic algorithms, and neural networks. It provides details on each concept, such as how concept learning uses positive and negative examples to learn concepts, how decision trees use nodes and branches to classify data, and how genetic algorithms and neural networks are modeled after biological processes. It also gives examples of applications for each concept, such as using decision trees for classification and neural networks for tasks like handwriting recognition where explicit rules are difficult to define.
The document discusses using graphs and graph databases to model social network data. It provides an example of a social graph for a person named Emil and his friends. It introduces the concept of a property graph and discusses how Neo4j is a graph database optimized for querying connected data. Cypher is introduced as the graph query language used for Neo4j, which uses pattern matching to query graph structures and relationships.
10-15 511 genetic algorithms and machine learning (alan nochenson)Alan Nochenson
This document discusses machine learning and genetic programming. It defines machine learning as a branch of artificial intelligence that uses data to capture uncertainty in probability distributions in order to yield patterns or predictions. Genetic programming is described as a machine learning technique inspired by biological evolution that uses populations of individuals that are evolved using the Darwinian processes of selection, crossover, and mutation. Examples are given of how genetic programming could be used to evolve a mathematical expression to equal a target number.
Finding the insights hidden in your graph dataDataStax
Linkurious is a graph visualization startup that helps companies understand graph data through its product Linkurious Enterprise. Linkurious Enterprise is a web app and API that allows non-technical users to explore relationships and uncover hidden insights in graph databases. It is compatible with DataStax DSE Graph and has been used by over 200 customers, including NASA, for applications such as anti-money laundering investigations, knowledge management, and network management.
20141015 how graphs revolutionize access managementRik Van Bruggen
This document discusses how graph databases can revolutionize access and identity management. It begins with an introduction to graphs and graph databases, explaining how they are well-suited for complex querying of connected data. The document then argues that graph databases allow for a more accurate representation of real-world identity relationships, which are often multi-dimensional, and enable real-time queries that eliminate the need for integration between different systems. A demonstration of a graph database is provided, followed by examples, licensing information and a question and answer section.
Graphgen aims at helping people prototyping a graph database, by providing a visual tool that ease the generation of nodes and relationships with a Cypher DSL.
Many people struggle with not only creating a good graph model of their domain but also with creating sensible example data to test hypotheses or use-cases.
Graphgen aims at helping people with no time but a good enough understanding of their domain model, by providing a visual dsl for data model generation which borrows heavily on Neo4j Cypher graph query language.
The ascii art allows even non-technical users to write and read model descriptions/configurations as concise as plain english but formal enough to be parseable. The underlying generator combines the DSL inputs (structure, cardinalities and amount-ranges) and combines them with a comprehensive fake data generation library to create real-world-like datasets of medium/arbitrary size and complexity.
Users can create their own models combining the basic building blocks of the dsl and share their data-descriptions with others with a simple link.
Algorithmic trading involves using computer algorithms to automate and execute trades electronically. It began in the 1970s with the introduction of electronic trading systems and has grown significantly, making up over 70% of US equity trading by 2009. Algorithmic trading allows for dividing large orders into many smaller trades to minimize market impact and risk. It provides benefits like lower costs and more control over the trading process, but also raises concerns about its role in increased volatility and events like the 2010 Flash Crash.
Bringing graph technologies to data analysis : the case of Azerbaijan in th...Linkurious
This document discusses using graph technologies to analyze data from the Offshore Leaks dataset regarding offshore accounts. It focuses on investigating the offshore accounts and connections of Azerbaijan's President Ilham Aliyev and his family. The analysis finds that President Aliyev controls offshore companies through his family that could be used to collect funds from businessmen awarded construction contracts. A direct connection is also found between President Aliyev and a businessman through one of the offshore accounts.
This document discusses link analysis and summarizes key points in the following 3 sentences:
The document outlines link analysis techniques such as bibliometrics, measures of similarity, ranking algorithms like PageRank and HITS, and characterization of the web graph structure. Hyperlinks provide valuable information for link-based ranking, structure analysis, detection of communities, and spam detection. Quantitative bibliometric laws and statistics are described to analyze patterns of publication citations.
An Introduction to Neural Networks and Machine LearningChris Nicholls
A nontechnical introduction to neural networks, with many examples and pictures. The first talk given at the Balliol College machine learning reading group.
Reinforcing AML systems with graph technologies.Linkurious
Anti-money laundering (AML) has become complex and costly for institutions and enterprises. Nowadays, to thwart criminal intricate strategies, financial crime units have to gather, monitor and investigate large amounts of connected data.
Graph analysis and visualization technologies can provide an holistic view of the various entities and their relationships to unveil wrongdoings.
Anti-money laundering (AML) has become complex and costly for institutions and enterprises. Graph analysis and visualization technologies like Linkurious are a great fit to help AML analysts fight money laundering.
Discover in this presentation how to automate the monitoring of high risk customers with patterns alerts and how to assess risk-levels by visually investigating suspicious cases.
More information on www.linkurio.us
Introduction to the graph technologies landscapeLinkurious
Graph technologies allow modeling of complex relationships and connections through nodes and edges. There are three main layers of graph technologies: graph databases to store graph data, graph analysis frameworks to analyze large graphs, and graph visualization solutions to interact with graphs. Popular tools in each layer include Neo4j and Titan for databases, Giraph and GraphX for analysis, and Gephi and Cytoscape for visualization. Graph technologies are gaining more attention due to their ability to extract insights from connected data.
This document discusses building a scalable data science platform with R. It describes R as a popular statistical programming language with over 2.5 million users. It notes that while R is widely used, its open source nature means it lacks enterprise capabilities for large-scale use. The document then introduces Microsoft R Server as a way to bring enterprise capabilities like scalability, efficiency, and support to R in order to make it suitable for production use on big data problems. It provides examples of using R Server with Hadoop and HDInsight on the Azure cloud to operationalize advanced analytics workflows from data cleaning and modeling to deployment as web services at scale.
GraphGen: Conducting Graph Analytics over Relational DatabasesPyData
This document discusses GraphGen, a tool for conducting graph analytics over relational databases. It begins by introducing graph analytics and its applications. It then discusses the current state of graph analytics, which is fragmented with no single solution. Most organizations store data relationally and have "hidden" graphs that can be extracted. GraphGen provides a declarative language to define nodes and edges to extract these graphs without ETL. It supports various interfaces like Java, Python, and a web application to enable graph analytics over relational data in an intuitive way.
Who am I and why do I feel that the world is not infinitely perfect? Which technologies should I use to rectify this situation? Enter the graph and the graph traversal.
This presentation introduces the graph model as obvious choice for rich and connected data. Graph Databases are a category of open-source NoSQL datastores which are specialized in storing, handling and querying graph structures efficiently.
Use cases represent the applicability of the graph model across many domains.
Neo4j as the most widely used graph database supports the property graph model, which is explained in detail.
To query a graph database a powerful and expressive but also friendly and easily understandable query language that is tailored for graph patterns is key. Neo4j's Cypher is such a query language developed from the ground up to support expressing challenging use-cases in a comprehensive way.
A series of examples rounds up the presentation to apply the lessons learned.
Racines en haut et feuilles en bas : les arbres en mathstuxette
1. The document discusses methods for clustering and differential analysis of Hi-C matrices, which represent the 3D organization of DNA.
2. It proposes extending Ward's hierarchical clustering to directly use Hi-C similarity matrices while enforcing adjacency constraints. A fast algorithm was also developed.
3. A new method called "treediff" was created to perform differential analysis of Hi-C matrices based on the Wasserstein distance between hierarchical clusterings. Software implementations of these methods were also developed.
Méthodes à noyaux pour l’intégration de données hétérogènestuxette
The document discusses a presentation about multi-omics data integration methods using kernel methods. The presentation introduces kernel methods, how they can be used to integrate heterogeneous omics data, and examples of applications. Specifically, it discusses using kernel methods to perform unsupervised transformation-based integration of multi-omics data. It also presents an application of constrained kernel hierarchical clustering to analyze Hi-C data by directly using Hi-C matrices as kernels.
Méthodologies d'intégration de données omiquestuxette
This document presents a presentation on multi-omics data integration methods given by Nathalie Vialaneix on December 13, 2023. The presentation discusses different types of omics data that can be integrated, both vertically across different levels of omics data on the same samples and horizontally across similar types of omics data on different samples. It also discusses different analysis approaches that can be taken, including supervised and unsupervised methods. The rest of the presentation focuses on unsupervised transformation-based integration methods using kernels.
The document discusses current and future work on analyzing Hi-C data and differential analysis of Hi-C matrices. It describes a clustering method developed to partition chromosomes based on Hi-C matrix similarity. It also introduces a new method called treediff for differential analysis of Hi-C data that calculates the distance between hierarchical clusterings. Current work includes reviewing differential analysis methods, investigating differential subtrees with multiple testing control, and inferring chromatin interaction networks.
Can deep learning learn chromatin structure from sequence?tuxette
This document discusses a deep learning model called ORCA that can predict chromatin structure from DNA sequence. The model uses a neural network with an encoder to extract features from sequence and a decoder to predict Hi-C matrices. It was trained on Hi-C data from multiple cell types and can predict interactions between regions at various resolutions. The model accurately captures features like CTCF-mediated loops and can predict effects of structural variants on chromatin structure. It allows for in silico mutagenesis to study how mutations may alter 3D genome organization.
Multi-omics data integration methods: kernel and other machine learning appro...tuxette
The document discusses multi-omics data integration methods, particularly kernel methods. It describes how kernel methods transform data into similarity matrices between samples rather than relying on variable space. Multiple kernel integration approaches are presented that combine multiple similarity matrices into a consensus kernel in an unsupervised manner, such as through a STATIS-like framework that maximizes the similarity between kernels. Examples of applications to datasets from the TARA Oceans expedition are given.
This document provides an overview of the MetaboWean and Idefics projects. MetaboWean aims to study the co-evolution of gut microbiota and epithelium during suckling-to-weaning transition in rabbits, using metabolomics, metagenomics, and single-cell RNA sequencing data. Idefics integrates multiple omics datasets from human skin samples to understand relationships between microorganisms and molecules and how they are structured in patient groups. The datasets include metagenomics, metabolomics, and proteomics from host and microbiota.
Rserve, renv, flask, Vue.js dans un docker pour intégrer des données omiques ...tuxette
ASTERICS is an interactive and integrative data analysis tool for omics data. It uses Rserve and PyRserve with Flask and Vue.js in a Docker container to integrate omics data. The backend uses Rserve and PyRserve with Flask on the server side, while the frontend uses Vue.js. This architecture was chosen for its open source and light design. Data communication between Rserve and PyRserve is limited, requiring an object database. ASTERICS is deployed using three Docker containers for R, Python, and
Apprentissage pour la biologie moléculaire et l’analyse de données omiquestuxette
This document summarizes a scientific presentation about molecular biology and omics data analysis. The presentation covers topics related to analyzing large omics datasets using methods like kernel methods, graphical models, and neural networks to learn gene regulation networks and predict phenotypes. Key challenges addressed are handling big data, missing values, non-Gaussian data types like counts and compositional data. The goal is to better understand complex biological systems from multi-omics data.
Quelques résultats préliminaires de l'évaluation de méthodes d'inférence de r...tuxette
The document summarizes preliminary results from evaluating methods for inferring gene regulatory networks from expression data in Bacillus subtilis. It finds that recall of the known network is generally poor (<20% for random forest), but inferred clusters still retain biological information about common regulators. It plans to confirm results, test restricting edges to sigma factors, and explore other inference methods like Bayesian networks and ARACNE.
Intégration de données omiques multi-échelles : méthodes à noyau et autres ap...tuxette
The document discusses methods for integrating multi-scale omics data using kernel and machine learning approaches. It describes how omics data is large, heterogeneous, and multi-scaled, creating bottlenecks for analysis. Methods discussed for data integration include multiple kernel learning to combine different relational datasets in an unsupervised way. The methods are applied to integrate different datasets from the TARA Oceans expedition to identify patterns in ocean microbial communities. Improving interpretability of the methods and making them more accessible to biological users is discussed.
Journal club: Validation of cluster analysis results on validation datatuxette
This document presents a framework for validating cluster analysis results on validation data. It describes situations where clustering is inferential versus descriptive and recommends using validation data separate from the data used for clustering. A typology of validation methods is provided, including validation based on the clustering method or results, and evaluation using internal validation, external validation, visual properties, or stability measures.
The document discusses the differences between overfitting and overparametrization in machine learning models. It explores how random forests may exhibit a phenomenon known as "double descent" where test error initially decreases then increases with more parameters before decreasing again. While double descent has been observed in other models, the document questions whether it is directly due to model complexity in random forests since very large trees may be unable to fully interpolate extremely large datasets.
Selective inference and single-cell differential analysistuxette
This document discusses selective inference and single-cell differential analysis. It introduces the problem of "double dipping" in the standard single-cell analysis pipeline where the same dataset is used for clustering and differential analysis. Two approaches for addressing this are presented: 1) A method that perturbs clusters before testing for differences, and 2) A test based on a truncated distribution that assumes clusters and genes are given separately. Experiments applying these methods to real single-cell datasets are described. The document outlines challenges in extending these approaches to more complex analyses.
SOMbrero : un package R pour les cartes auto-organisatricestuxette
SOMbrero is an R package that implements self-organizing map (SOM) algorithms. It can handle numeric, non-numeric, and relational data. The package contains functions for training SOMs, diagnosing results, and plotting maps. It also includes tools like a shiny app and vignettes to aid users without programming experience. SOMbrero supports missing data imputation and extends SOM to relational datasets through non-Euclidean distance measures.
Graph Neural Network for Phenotype Predictiontuxette
This document describes a study on using graph neural networks (GNNs) for phenotype prediction from gene expression data. The objectives are to determine if including network information can improve predictions, which network types work best, and if GNNs can learn network inferences. It provides background on GNNs and how they generalize convolutional layers to graph data. The authors implemented a GNN model from previous work as a starting point and tested it on different network types to see which network information is most useful for predictions. Their methodology involves comparing GNN performance to other methods like random forests using 10-fold cross validation.
A short and naive introduction to using network in prediction modelstuxette
The document provides an introduction to using network information in prediction models. It discusses representing a network as a graph with a Laplacian matrix. The Laplacian captures properties like random walks on the graph and heat diffusion. Eigenvectors of the Laplacian related to small eigenvalues are strongly tied to graph structure. The document discusses using the Laplacian in prediction models by working in the feature space defined by the Laplacian eigenvectors or directly regularizing a linear model with the Laplacian. This introduces network information and encourages similar contributions from connected nodes. The approaches are applied to problems like predicting phenotypes from gene expression using a known gene network.
Fouille de données issues d’un grand graphe par carte de Kohonen à noyau
1. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Fouille de données issues d’un grand graphe
par carte de Kohonen à noyau
Nathalie Villa-Vialaneix
En collaboration avec Fabrice Rossi, Romain Boulet & Bertrand
Jouve
Institut de Mathématiques de Toulouse, France -
nathalie.villa@math.univ-toulouse.fr
Séminaire BIA Toulouse, 13 mars 2008
Nathalie Villa Séminaire BIA - 13 mars 2008
2. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Sommaire
1 Contexte et motivations
2 Cartes de Kohonen
3 Noyau de la chaleur
4 Résultats
Nathalie Villa Séminaire BIA - 13 mars 2008
3. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Sommaire
1 Contexte et motivations
2 Cartes de Kohonen
3 Noyau de la chaleur
4 Résultats
Nathalie Villa Séminaire BIA - 13 mars 2008
4. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Explorer une grosse base de données historique
Data
1000 contrats agraires,
de 4 seigneuries (environ 10 villages) du Lot,
établis entre 1250 et 1350 (avant la guerre de cent ans).
Nathalie Villa Séminaire BIA - 13 mars 2008
5. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Explorer une grosse base de données historique
Data
1000 contrats agraires,
de 4 seigneuries (environ 10 villages) du Lot,
établis entre 1250 et 1350 (avant la guerre de cent ans).
Questions des historiens :
les liens sociaux sont-ils familiaux ? géographiques ?
peut-on trouver des personnalités ayant un rôle social
prépondérant ? des familles ?
. . .
Nathalie Villa Séminaire BIA - 13 mars 2008
6. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Explorer une grosse base de données historique
Data
1000 contrats agraires,
de 4 seigneuries (environ 10 villages) du Lot,
établis entre 1250 et 1350 (avant la guerre de cent ans).
Questions des historiens :
les liens sociaux sont-ils familiaux ? géographiques ?
peut-on trouver des personnalités ayant un rôle social
prépondérant ? des familles ?
. . .
⇒ Data mining est nécessaire.
Nathalie Villa Séminaire BIA - 13 mars 2008
7. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Un problème modélisé par un graphe
À partir de la base de données, construire un graphe pondéré:
avec 615 sommets x1, . . . , xn := paysans nommés dans les
contrats ;
Nathalie Villa Séminaire BIA - 13 mars 2008
8. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Un problème modélisé par un graphe
À partir de la base de données, construire un graphe pondéré:
avec 615 sommets x1, . . . , xn := paysans nommés dans les
contrats ;
avec des poids (wi,j)i,j=1,...,n := {contrats où xi et xj sont cités
simultanément}.
Nathalie Villa Séminaire BIA - 13 mars 2008
9. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Un problème modélisé par un graphe
À partir de la base de données, construire un graphe pondéré:
avec 615 sommets x1, . . . , xn := paysans nommés dans les
contrats ;
avec des poids (wi,j)i,j=1,...,n := {contrats où xi et xj sont cités
simultanément}.
Nombre de sommets : 615
Nombres d’arêtes : 4193
Total des poids : 40 329
Diametre : 10
Densité : 2,2%
Nathalie Villa Séminaire BIA - 13 mars 2008
10. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Un problème modélisé par un graphe
À partir de la base de données, construire un graphe pondéré:
avec 615 sommets x1, . . . , xn := paysans nommés dans les
contrats ;
avec des poids (wi,j)i,j=1,...,n := {contrats où xi et xj sont cités
simultanément}.
Nombre de sommets : 615
Nombres d’arêtes : 4193
Total des poids : 40 329
Diametre : 10
Densité : 2,2%
Classer les sommets en groupes sociaux homogènes pour
comprendre la structure globale de la communauté paysanne.
Nathalie Villa Séminaire BIA - 13 mars 2008
11. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Un double objectif : classification et organisation
Nathalie Villa Séminaire BIA - 13 mars 2008
12. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Un double objectif : classification et organisation
Classer les sommets en groupes de proximité. . .
Nathalie Villa Séminaire BIA - 13 mars 2008
13. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Un double objectif : classification et organisation
Classer les sommets en groupes de proximité. . . et organiser les groupes.
Nathalie Villa Séminaire BIA - 13 mars 2008
14. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Sommaire
1 Contexte et motivations
2 Cartes de Kohonen
3 Noyau de la chaleur
4 Résultats
Nathalie Villa Séminaire BIA - 13 mars 2008
15. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Principe général de l’algorithme de Kohonen
[Kohonen, 2001]
Soient des données (xi)i=1,...,n ∈ H (espace vectoriel de grande
dimension, graphe, . . . ).
Nathalie Villa Séminaire BIA - 13 mars 2008
16. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Principe général de l’algorithme de Kohonen
[Kohonen, 2001]
Chaque xi est affecté à un neurone (une classe) de la carte, f(xi).
Les neurones sont définis les uns par rapport aux autres par une
relation de voisinage (“distance”: d).
Nathalie Villa Séminaire BIA - 13 mars 2008
17. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Principe général de l’algorithme de Kohonen
[Kohonen, 2001]
p1
p2
p3
1
2
3
Chaque neurone j de la carte est représenté par un prototype pj.
Les couples (j, pj) et (xi, f(xi)) dépendent l’un de l’autre et sont
remis à jour itérativement.
Nathalie Villa Séminaire BIA - 13 mars 2008
18. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Préserver la topologie des données dans H
Énergie
Le but est de minimiser l’énergie de la carte :
E =
M
i=1
h(d(f(x), i)) x − pi
2
H dP(x)
où h est une fonction décroissante (ex : h(t) = αe−t/2σ2
).
Nathalie Villa Séminaire BIA - 13 mars 2008
19. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Préserver la topologie des données dans H
Énergie
Le but est de minimiser l’énergie de la carte :
E =
M
i=1
h(d(f(x), i)) x − pi
2
H dP(x)
où h est une fonction décroissante (ex : h(t) = αe−t/2σ2
).
L’énergie est approchée par sa version empirique :
En
=
n
j=1
M
i=1
h(d(f(xj), i)) xj − pi
2
H .
et la minimisation est approchée par l’algorithme SOM.
Nathalie Villa Séminaire BIA - 13 mars 2008
20. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Batch SOM
Initialiser de manière aléatoire γ0
ji
∈ R (i, j = 1, . . . , n) et
p0
j
= n
i=1 γ0
ji
xi. Ensuite, pour l = 1, . . . , n répéter
Nathalie Villa Séminaire BIA - 13 mars 2008
21. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Batch SOM
Initialiser de manière aléatoire γ0
ji
∈ R (i, j = 1, . . . , n) et
p0
j
= n
i=1 γ0
ji
xi. Ensuite, pour l = 1, . . . , n répéter
Phase d’affectation
pour tout xi,
fl
(xi) = arg min
j=1,...,M
xi −
n
i=1
γl
jixi
H
Nathalie Villa Séminaire BIA - 13 mars 2008
22. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Batch SOM
Initialiser de manière aléatoire γ0
ji
∈ R (i, j = 1, . . . , n) et
p0
j
= n
i=1 γ0
ji
xi. Ensuite, pour l = 1, . . . , n répéter
Phase d’affectation
pour tout xi,
fl
(xi) = arg min
j=1,...,M
xi −
n
i=1
γl
jixi
H
Phase de représentation
γl
j = arg min
γ∈Rn
n
i=1
h(fl
(xi), j) xi −
n
l =1
γl xl
2
H
Nathalie Villa Séminaire BIA - 13 mars 2008
23. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Batch SOM
Initialiser de manière aléatoire γ0
ji
∈ R (i, j = 1, . . . , n) et
p0
j
= n
i=1 γ0
ji
xi. Ensuite, pour l = 1, . . . , n répéter
Phase d’affectation
pour tout xi,
fl
(xi) = arg min
j=1,...,M
xi −
n
i=1
γl
jixi
H
Phase de représentation
γl
j = arg min
γ∈Rn
n
i=1
h(fl
(xi), j) xi −
n
l =1
γl xl
2
H
Problème : Quelle “distance” définir entre deux sommets ???
Nathalie Villa Séminaire BIA - 13 mars 2008
24. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Sommaire
1 Contexte et motivations
2 Cartes de Kohonen
3 Noyau de la chaleur
4 Résultats
Nathalie Villa Séminaire BIA - 13 mars 2008
25. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Dissimilarités usuelles entre sommets
L’indice de Dice (Jaccard) :
D(xi, xj) =
Γ(xi) ∩ Γ(xj)
|Γ(xi)| + |Γ(xj)|
(graphes non pondérés) ;
Nathalie Villa Séminaire BIA - 13 mars 2008
26. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Dissimilarités usuelles entre sommets
L’indice de Dice (Jaccard) :
D(xi, xj) =
Γ(xi) ∩ Γ(xj)
|Γ(xi)| + |Γ(xj)|
(graphes non pondérés) ;
Dissimilarités basées sur les plus courts chemins ;
Nathalie Villa Séminaire BIA - 13 mars 2008
27. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Dissimilarités usuelles entre sommets
L’indice de Dice (Jaccard) :
D(xi, xj) =
Γ(xi) ∩ Γ(xj)
|Γ(xi)| + |Γ(xj)|
(graphes non pondérés) ;
Dissimilarités basées sur les plus courts chemins ;
Dissimilarités ou distances basées sur le Laplacien : “spectral
clustering”.
Nathalie Villa Séminaire BIA - 13 mars 2008
28. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Laplacien [Kondor and Lafferty, 2002]
Définitions
Pour un graphe de sommets V = {x1, . . . , xn} et de poids positifs
(wi,j)i,j=1,...,n tels que, pour tout i, j = 1, . . . , n, wi,j = wj,i and di = n
j=1 wi,j,
Laplacien : L = (Li,j)i,j=1,...,n où
Li,j =
−wi,j if i j
di if i = j
;
Nathalie Villa Séminaire BIA - 13 mars 2008
29. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Propriétés du Laplacien I [von Luxburg, 2007]
Composantes connexes
KerL = Span{IA1
, . . . , IAk
} où Ai indique les positions des sommets
de la ième composante connexe du graphe.
1
4
5
2
3
KerL = Span
1
0
0
1
1
;
0
1
1
0
0
Nathalie Villa Séminaire BIA - 13 mars 2008
30. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Propriétés du Laplacien II [Boulet et al., 2008]
Communauté parfaite : Sous-graphe complet (clique) dont les
sommets possèdent les mêmes voisins à l’extérieur de la clique.
Laplacien and communautés parfaites
Pour un graphe non pondéré,
Le graphe a une communauté parfaite à m sommets
⇔
L possède m vecteurs propres qui ont les mêmes n − m
coordonnées nulles.
Nathalie Villa Séminaire BIA - 13 mars 2008
31. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Propriétés du Laplacien II [Boulet et al., 2008]
Communauté parfaite : Sous-graphe complet (clique) dont les
sommets possèdent les mêmes voisins à l’extérieur de la clique.
Application :
Nathalie Villa Séminaire BIA - 13 mars 2008
32. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Propriétés du Laplacien II [Boulet et al., 2008]
Communauté parfaite : Sous-graphe complet (clique) dont les
sommets possèdent les mêmes voisins à l’extérieur de la clique.
Application :
Limite : Seuls 1/3 des sommets du graphe peuvent être
représentés de cette manière.
Nathalie Villa Séminaire BIA - 13 mars 2008
33. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Propriétés du Laplacien III [von Luxburg, 2007]
Problème de la coupe optimale : Supposons que le graphe soit
connexe.
Trouver une classification des sommets du graphe, A1, . . . , Ak telle
que
1
2
k
i=1 j∈Ai,j Ai
wj,j
est minimale , est équivalent à
H = arg min
h∈Rn×k
Tr hT
Lh subject to
hT
h = I
hi = 1/
√
|Ai|1Ai
Nathalie Villa Séminaire BIA - 13 mars 2008
34. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Propriétés du Laplacien III [von Luxburg, 2007]
Problème de la coupe optimale : Supposons que le graphe soit
connexe.
Trouver une classification des sommets du graphe, A1, . . . , Ak telle
que
1
2
k
i=1 j∈Ai,j Ai
wj,j
est minimale , est équivalent à
H = arg min
h∈Rn×k
Tr hT
Lh subject to
hT
h = I
hi = 1/
√
|Ai|1Ai
⇒ problème NP-complet.
Nathalie Villa Séminaire BIA - 13 mars 2008
35. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Propriétés du Laplacien III [von Luxburg, 2007]
Problème de la coupe optimale : Supposons que le graphe soit
connexe.
Trouver une classification des sommets du graphe, A1, . . . , Ak telle
que
1
2
k
i=1 j∈Ai,j Ai
wj,j
est minimale peut être approché par
H = arg min
h∈Rn×k
Tr hT
Lh subject to hT
h = I
Nathalie Villa Séminaire BIA - 13 mars 2008
36. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Propriétés du Laplacien III [von Luxburg, 2007]
Problème de la coupe optimale : Supposons que le graphe soit
connexe.
Trouver une classification des sommets du graphe, A1, . . . , Ak telle
que
1
2
k
i=1 j∈Ai,j Ai
wj,j
est minimale peut être approché par
H = arg min
h∈Rn×k
Tr hT
Lh subject to hT
h = I
Spectral clustering : Trouver les vecteurs propres associés aux k
plus petites valeurs propres de L, H, et faire la classification sur les
colonnes de H.
Nathalie Villa Séminaire BIA - 13 mars 2008
37. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Une version régularisée de L
Régularisation : la matrice de diffusion : pour β > 0,
Kβ = e−βL
= +∞
k=1
(−βL)k
k! .
⇒
kβ
: V × V → R
(xi, xj) → K
β
i,j
noyau de diffusion (ou noyau de la chaleur).
Nathalie Villa Séminaire BIA - 13 mars 2008
38. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Processus de diffusion sur un graphe
Si Z0 = (1 1 1 . . . 1 1)T
est la “chaleur” de chaque sommet au
temps 0 et si une petite fraction de cette chaleur se propage le
long des arêtes du graphe à chaque pas de temps, alors après t
pas de temps, la chaleur des sommets du graphe est :
Zt = (1 + L)t
Z0
Nathalie Villa Séminaire BIA - 13 mars 2008
39. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Processus de diffusion sur un graphe
Si Z0 = (1 1 1 . . . 1 1)T
est la “chaleur” de chaque sommet au
temps 0 et si une petite fraction de cette chaleur se propage le
long des arêtes du graphe à chaque pas de temps, alors après t
pas de temps, la chaleur des sommets du graphe est :
Zt = (1 + L)t
Z0
Limites : Pas de temps ∆t par : t → t/(∆t) et → ∆t ; alors
(∆t) → 0 (processus de diffusion continu) ce qui donne :
lim Zt = e tL
= K t
Nathalie Villa Séminaire BIA - 13 mars 2008
40. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Properties
1 Diffusion sur le graphe : kβ(xi, xj) quantité de chaleur
accumulée dans xj après un temps donné si la chaleur 1 est
injectée dans xi au temps 0 et si la diffusion est effectuée de
manière continue le long des arêtes du graphe.
β intensité de la diffusion;
Nathalie Villa Séminaire BIA - 13 mars 2008
41. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Properties
1 Diffusion sur le graphe : kβ(xi, xj) quantité de chaleur
accumulée dans xj après un temps donné si la chaleur 1 est
injectée dans xi au temps 0 et si la diffusion est effectuée de
manière continue le long des arêtes du graphe.
β intensité de la diffusion;
2 Opérateur régularisant : pour u ∈ Rn
∼ V, uT
Kβu est plus
grand pour les vecteurs u qui varient beaucoup entre deux
sommets “proches” du graphe.
β intensité de la regularisation (pour des petits β, les
voisinages directs sont plus importants);
Nathalie Villa Séminaire BIA - 13 mars 2008
42. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Properties
1 Diffusion sur le graphe : kβ(xi, xj) quantité de chaleur
accumulée dans xj après un temps donné si la chaleur 1 est
injectée dans xi au temps 0 et si la diffusion est effectuée de
manière continue le long des arêtes du graphe.
β intensité de la diffusion;
2 Opérateur régularisant : pour u ∈ Rn
∼ V, uT
Kβu est plus
grand pour les vecteurs u qui varient beaucoup entre deux
sommets “proches” du graphe.
β intensité de la regularisation (pour des petits β, les
voisinages directs sont plus importants);
3 Propriété de noyau reproduisant : kβ est symétrique et
positif ⇒ ∃ Hilbert space (H, ., . ) et φ : V → H tel que
kβ
(xi, xj) = φ(xi), φ(xj) .
Nathalie Villa Séminaire BIA - 13 mars 2008
43. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Batch kernel SOM [Villa and Rossi, 2007]
Initialiser de manière aléatoire γ0
ji
∈ R (i, j = 1, . . . , n) et
p0
j
= n
i=1 γ0
ji
φ(xi). Ensuite, pour l = 1, . . . , n répéter
Phase d’affectation
pour tout xi,
fl
(xi) = arg min
j=1,...,M
φ(xi) −
n
i=1
γl
jiφ(xi)
H
Phase de représentation
γl
j = arg min
γ∈Rn
n
i=1
h(fl
(xi), j) φ(xi) −
n
l =1
γl φ(xl )
2
H
Nathalie Villa Séminaire BIA - 13 mars 2008
44. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Batch kernel SOM [Villa and Rossi, 2007]
Initialiser de manière aléatoire γ0
ji
∈ R (i, j = 1, . . . , n) et
p0
j
= n
i=1 γ0
ji
φ(xi). Ensuite, pour l = 1, . . . , n répéter
Phase d’affectation
pour tout xi,
f(xi) = arg min
j=1,...,M
n
u,u =1
γjuγju kβ
(xu, xu ) − 2
n
u=1
γjukβ
(xu, xi)
Phase de représentation
γl
ji =
h(fl
(xi), j))
n
i =1 h(fl(xi , j))
Nathalie Villa Séminaire BIA - 13 mars 2008
45. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Sommaire
1 Contexte et motivations
2 Cartes de Kohonen
3 Noyau de la chaleur
4 Résultats
Nathalie Villa Séminaire BIA - 13 mars 2008
46. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Cartes obtenues [Boulet et al., 2008]
Nathalie Villa Séminaire BIA - 13 mars 2008
48. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Quelques cartes thématiques
1 Noms
2 Dates et Comparaison
3 Lieux et Comparaison
Nathalie Villa Séminaire BIA - 13 mars 2008
49. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Représentation globale La Suite...
Réalisée par Dinh Truong et Tao Dkaki
Nathalie Villa Séminaire BIA - 13 mars 2008
50.
51.
52. Contexte et motivations
Cartes de Kohonen
Noyau de la chaleur
Résultats
Références
Boulet, R., Jouve, B., Rossi, F., and Villa, N. (2008).
Batch kernel SOM and related laplacian methods for social network
analysis.
Neurocomputing.
To appear.
Kohonen, T. (2001).
Self-Organizing Maps, 3rd Edition, volume 30.
Springer, Berlin, Heidelberg, New York.
Kondor, R. and Lafferty, J. (2002).
Diffusion kernels on graphs and other discrete structures.
In Proceedings of the 19th International Conference on Machine Learning,
pages 315–322.
Villa, N. and Rossi, F. (2007).
A comparison between dissimilarity SOM and kernel SOM for clustering the
vertices of a graph.
In Proceedings of the 6th Workshop on Self-Organizing Maps (WSOM 07),
Bielefield, Germany. Nathalie Villa Séminaire BIA - 13 mars 2008