Tutoriel sous forme d'exercices pour découvrir le sparql endpoint mis à disposition par la plateforme HAL, archive ouverte d'article scientifiques de toutes disciplines des institutions de recherches françaises. Attention ! Ce tutoriel a pour pré-requis la connaissance du langage de requêtes SPARQL.
Review the latest features released in Neo4j version 4.1 including Cypher, database drivers, clustering, security, and extension libraries like APOC and Spring Data Neo4j!
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
Flink Forward San Francisco 2022.
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.
by
Mason Chen
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
Independent of the source of data, the integration and analysis of event streams gets more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analyzed, often with many consumers or systems interested in all or part of the events. In this session we compare two popular Streaming Analytics solutions: Spark Streaming and Kafka Streams.
Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala.
Kafka Streams is the stream processing solution which is part of Kafka. It is provided as a Java library and by that can be easily integrated with any Java application.
This presentation shows how you can implement stream processing solutions with each of the two frameworks, discusses how they compare and highlights the differences and similarities.
A comparison of different solutions for full-text search in web applications using PostgreSQL and other technology. Presented at the PostgreSQL Conference West, in Seattle, October 2009.
Review the latest features released in Neo4j version 4.1 including Cypher, database drivers, clustering, security, and extension libraries like APOC and Spring Data Neo4j!
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
Flink Forward San Francisco 2022.
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.
by
Mason Chen
Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz
Independent of the source of data, the integration and analysis of event streams gets more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analyzed, often with many consumers or systems interested in all or part of the events. In this session we compare two popular Streaming Analytics solutions: Spark Streaming and Kafka Streams.
Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala.
Kafka Streams is the stream processing solution which is part of Kafka. It is provided as a Java library and by that can be easily integrated with any Java application.
This presentation shows how you can implement stream processing solutions with each of the two frameworks, discusses how they compare and highlights the differences and similarities.
A comparison of different solutions for full-text search in web applications using PostgreSQL and other technology. Presented at the PostgreSQL Conference West, in Seattle, October 2009.
Elasticsearch is a very flexible and useful search tool for any application. It helps to store data in multiple scopes, customize search relevance, collect statistics, and improve overall user experience. Let us see how Elasticsearch has been integrated with OroCommerce to improve E-Commerce flows and achieve better results. We will use the OroCommerce application to show real life use cases and how to implement them.
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Yongho Ha
요즘 Hadoop 보다 더 뜨고 있는 Spark.
그 Spark의 핵심을 이해하기 위해서는 핵심 자료구조인 Resilient Distributed Datasets (RDD)를 이해하는 것이 필요합니다.
RDD가 어떻게 동작하는지, 원 논문을 리뷰하며 살펴보도록 합시다.
http://www.cs.berkeley.edu/~matei/papers/2012/sigmod_shark_demo.pdf
Big Data! Great! Now What? #SymfonyCon 2014Ricard Clau
Big Data is one of the new buzzwords in the industry. Everyone is using NoSQL databases. MySQL is not cool anymore. But... do we really have big data? Where should we store it? Are the traditional RDBMS databases dead? Is NoSQL the solution to our problems? And most importantly, how can PHP and Symfony2 help with it?
At the beginning of 2021, Shopify Data Platform decided to adopt Apache Flink to enable modern stateful stream-processing. Shopify had a lot of experience with other streaming technologies, but Flink was a great fit due to its state management primitives.
After about six months, Shopify now has a flourishing ecosystem of tools, tens of prototypes from many teams across the company and a few large use-cases in production.
Yaroslav will share a story about not just building a single data pipeline but building a sustainable ecosystem. You can learn about how they planned their platform roadmap, the tools and libraries Shopify built, the decision to fork Flink, and how Shopify partnered with other teams and drove the adoption of streaming at the company.
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...confluent
Kafka Streams is a library for developing applications for processing records from topics in Apache Kafka. It provides high-level Streams DSL and low-level Processor API for describing fault-tolerant distributed streaming pipelines in Java or Scala programming languages. Kafka Streams also offers elaborate API for stateless and stateful stream processing. That’s a high-level view of Kafka Streams. Have you ever wondered how Kafka Streams does all this and what the relationship with Apache Kafka (brokers) is? That’s among the topics of the talk.
During this talk we will look under the covers of Kafka Streams and deep dive into Kafka Streams’ Fault-Tolerant Distributed Stream Processing Engine. You will know the role of StreamThreads, TaskManager, StreamTasks, StandbyTasks, StreamsPartitionAssignor, RebalanceListener and few others. The aim of this talk is to get you equipped with knowledge about the internals of Kafka Streams that should help you fine-tune your stream processing pipelines for better performance.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Understanding the Single Thread Event LoopTorontoNodeJS
Node JS was built on Google's JavaScript V8 Engine and it was engineered to preform optimally for the web. Learn what Node JS's single thread event loop is and how it empowers Node to out preform its competitors.
Speed up UDFs with GPUs using the RAPIDS AcceleratorDatabricks
The RAPIDS Accelerator for Apache Spark is a plugin that enables the power of GPUs to be leveraged in Spark DataFrame and SQL queries, improving the performance of ETL pipelines. User-defined functions (UDFs) in the query appear as opaque transforms and can prevent the RAPIDS Accelerator from processing some query operations on the GPU.
This presentation discusses how users can leverage the RAPIDS Accelerator UDF Compiler to automatically translate some simple UDFs to equivalent Catalyst operations that are processed on the GPU. The presentation also covers how users can provide a GPU version of Scala, Java, or Hive UDFs for maximum control and performance. Sample UDFs for each case will be shown along with how the query plans are impacted when the UDFs are processed on the GPU.
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, ConfluentHostedbyConfluent
Joins in Kafka Streams and ksqlDB are a killer-feature for data processing and basic join semantics are well understood. However, in a streaming world records are associated with timestamps that impact the semantics of joins: welcome to the fabulous world of _temporal_ join semantics. For joins, timestamps are as important as the actual data and it is important to understand how they impact the join result.
In this talk we want to deep dive on the different types of joins, with a focus of their temporal aspect. Furthermore, we relate the individual join operators to the overall ""time engine"" of the Kafka Streams query runtime and explain its relationship to operator semantics. To allow developers to apply their knowledge on temporal join semantics, we provide best practices, tip and tricks to ""bend"" time, and configuration advice to get the desired join results. Last, we give an overview of recent, and an outlook to future, development that improves joins even further.
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Databricks
"Continuous applications" supported by Apache Spark's Structured Streaming API enable real-time decision making in the areas such as IoT, AI, fraud mitigation, personalized experience, etc. All continuous applications have one thing in common: they collect data from various sources (devices in IoT, for example), process them in real-time (example: ETL), and deliver them to machine learning serving layer for decision making. Continuous applications face many challenges as they grow to production. Often, due to the rapid increase in the number devices or end-users or other data sources, the size of their data set grows exponentially. This results in a backlog of data to be processed. The data will no longer be processed in near-real-time. Redis, the open-source, in-memory database offers many options to handle this situation in a cost-effective manner. First and foremost, you could insert Redis into an existing continuous application without disrupting its architecture, and with minimal code changes. Redis, being in-memory, allows over a million writes per second with sub-millisecond latency. The Redis Stream data structure enables you to collect both binary and text data in the time series format. The consumer groups of Redis Stream help you match the data processing rate of your continuous application with the rate of data arrival from various sources. In this session, I will perform a live demonstration of how to integrate a continuous application using Apache Spark's Structured Streaming API with open source Redis. I will also walk through the code, and run a live IoT continuous application.
Speaker: Roshan Kumar
Hyperspace: An Indexing Subsystem for Apache SparkDatabricks
At Microsoft, we store datasets (both from internal teams and external customers) ranging from a few GBs to 100s of PBs in our data lake. The scope of analytics on these datasets ranges from traditional batch-style queries (e.g., OLAP) to explorative, ‘finding needle in a haystack’ type of queries (e.g., point-lookups, summarization etc.).
CIDOC-CRM + SPARQL Tutorial sur les données DoremusThomas Francart
Introduction aux requêtes SPARQL sur les données du projet Doremus (http://data.doremus.org) qui modélise et diffuse les données de création d'oeuvres musicales sur la base du modèle CIDOC-CRM / FRBRoo.
Elasticsearch is a very flexible and useful search tool for any application. It helps to store data in multiple scopes, customize search relevance, collect statistics, and improve overall user experience. Let us see how Elasticsearch has been integrated with OroCommerce to improve E-Commerce flows and achieve better results. We will use the OroCommerce application to show real life use cases and how to implement them.
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Yongho Ha
요즘 Hadoop 보다 더 뜨고 있는 Spark.
그 Spark의 핵심을 이해하기 위해서는 핵심 자료구조인 Resilient Distributed Datasets (RDD)를 이해하는 것이 필요합니다.
RDD가 어떻게 동작하는지, 원 논문을 리뷰하며 살펴보도록 합시다.
http://www.cs.berkeley.edu/~matei/papers/2012/sigmod_shark_demo.pdf
Big Data! Great! Now What? #SymfonyCon 2014Ricard Clau
Big Data is one of the new buzzwords in the industry. Everyone is using NoSQL databases. MySQL is not cool anymore. But... do we really have big data? Where should we store it? Are the traditional RDBMS databases dead? Is NoSQL the solution to our problems? And most importantly, how can PHP and Symfony2 help with it?
At the beginning of 2021, Shopify Data Platform decided to adopt Apache Flink to enable modern stateful stream-processing. Shopify had a lot of experience with other streaming technologies, but Flink was a great fit due to its state management primitives.
After about six months, Shopify now has a flourishing ecosystem of tools, tens of prototypes from many teams across the company and a few large use-cases in production.
Yaroslav will share a story about not just building a single data pipeline but building a sustainable ecosystem. You can learn about how they planned their platform roadmap, the tools and libraries Shopify built, the decision to fork Flink, and how Shopify partnered with other teams and drove the adoption of streaming at the company.
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...confluent
Kafka Streams is a library for developing applications for processing records from topics in Apache Kafka. It provides high-level Streams DSL and low-level Processor API for describing fault-tolerant distributed streaming pipelines in Java or Scala programming languages. Kafka Streams also offers elaborate API for stateless and stateful stream processing. That’s a high-level view of Kafka Streams. Have you ever wondered how Kafka Streams does all this and what the relationship with Apache Kafka (brokers) is? That’s among the topics of the talk.
During this talk we will look under the covers of Kafka Streams and deep dive into Kafka Streams’ Fault-Tolerant Distributed Stream Processing Engine. You will know the role of StreamThreads, TaskManager, StreamTasks, StandbyTasks, StreamsPartitionAssignor, RebalanceListener and few others. The aim of this talk is to get you equipped with knowledge about the internals of Kafka Streams that should help you fine-tune your stream processing pipelines for better performance.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Understanding the Single Thread Event LoopTorontoNodeJS
Node JS was built on Google's JavaScript V8 Engine and it was engineered to preform optimally for the web. Learn what Node JS's single thread event loop is and how it empowers Node to out preform its competitors.
Speed up UDFs with GPUs using the RAPIDS AcceleratorDatabricks
The RAPIDS Accelerator for Apache Spark is a plugin that enables the power of GPUs to be leveraged in Spark DataFrame and SQL queries, improving the performance of ETL pipelines. User-defined functions (UDFs) in the query appear as opaque transforms and can prevent the RAPIDS Accelerator from processing some query operations on the GPU.
This presentation discusses how users can leverage the RAPIDS Accelerator UDF Compiler to automatically translate some simple UDFs to equivalent Catalyst operations that are processed on the GPU. The presentation also covers how users can provide a GPU version of Scala, Java, or Hive UDFs for maximum control and performance. Sample UDFs for each case will be shown along with how the query plans are impacted when the UDFs are processed on the GPU.
Temporal-Joins in Kafka Streams and ksqlDB | Matthias Sax, ConfluentHostedbyConfluent
Joins in Kafka Streams and ksqlDB are a killer-feature for data processing and basic join semantics are well understood. However, in a streaming world records are associated with timestamps that impact the semantics of joins: welcome to the fabulous world of _temporal_ join semantics. For joins, timestamps are as important as the actual data and it is important to understand how they impact the join result.
In this talk we want to deep dive on the different types of joins, with a focus of their temporal aspect. Furthermore, we relate the individual join operators to the overall ""time engine"" of the Kafka Streams query runtime and explain its relationship to operator semantics. To allow developers to apply their knowledge on temporal join semantics, we provide best practices, tip and tricks to ""bend"" time, and configuration advice to get the desired join results. Last, we give an overview of recent, and an outlook to future, development that improves joins even further.
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Databricks
"Continuous applications" supported by Apache Spark's Structured Streaming API enable real-time decision making in the areas such as IoT, AI, fraud mitigation, personalized experience, etc. All continuous applications have one thing in common: they collect data from various sources (devices in IoT, for example), process them in real-time (example: ETL), and deliver them to machine learning serving layer for decision making. Continuous applications face many challenges as they grow to production. Often, due to the rapid increase in the number devices or end-users or other data sources, the size of their data set grows exponentially. This results in a backlog of data to be processed. The data will no longer be processed in near-real-time. Redis, the open-source, in-memory database offers many options to handle this situation in a cost-effective manner. First and foremost, you could insert Redis into an existing continuous application without disrupting its architecture, and with minimal code changes. Redis, being in-memory, allows over a million writes per second with sub-millisecond latency. The Redis Stream data structure enables you to collect both binary and text data in the time series format. The consumer groups of Redis Stream help you match the data processing rate of your continuous application with the rate of data arrival from various sources. In this session, I will perform a live demonstration of how to integrate a continuous application using Apache Spark's Structured Streaming API with open source Redis. I will also walk through the code, and run a live IoT continuous application.
Speaker: Roshan Kumar
Hyperspace: An Indexing Subsystem for Apache SparkDatabricks
At Microsoft, we store datasets (both from internal teams and external customers) ranging from a few GBs to 100s of PBs in our data lake. The scope of analytics on these datasets ranges from traditional batch-style queries (e.g., OLAP) to explorative, ‘finding needle in a haystack’ type of queries (e.g., point-lookups, summarization etc.).
CIDOC-CRM + SPARQL Tutorial sur les données DoremusThomas Francart
Introduction aux requêtes SPARQL sur les données du projet Doremus (http://data.doremus.org) qui modélise et diffuse les données de création d'oeuvres musicales sur la base du modèle CIDOC-CRM / FRBRoo.
Oxalide Workshop #3 - Elasticearch, an overviewLudovic Piot
Après les 2 précédents ateliers Varnish, c’est au tour d’ElasticSearch de passer entre les mains Ludovic Piot (Oxalide) avec Edouard Fajnzilberg (Kernel42) . Ils ont déroulé le sujet avec les points de vue Syadmin et Dev.
Subject: Oxalide's workshop about an overview of elasticsearch.
Date: 10-mar-2016
Speakers: Edouard Fajnzilberg (Kernel42) and Ludovic Piot (Oxalide)
Language: french
Video capture: https://youtu.be/3bPoeVoUdFI
Main topics:
When do we use elasticsearch?
Why is it cool?
Introduction to Head plugin
Introduction to the REST API
Introduction to the Query DSL and the JSON document
How to configure a cluster?
How does it compare to a SGBD-R?
How does a reversed-index work?
An explaination of Lucene Segments
An explaination of the cluster architecture
An overview of the mappings (principles, dynamic mapping and templates)
An overview of the aggregations (buckets, metrics, multiple, nestable, sortable, aggregation types, use cases, pipelines)
An overview of the ecosystem (Sense, Logstash, Beats, Kibana, TimeLion, Marvel, Watcher, Shield, Head, Kopf, HQ, Inquisitor, BigDesk, SegmentSpy)
Oxalide Academy : Workshop #3 Elastic SearchOxalide
Atelier organisé par Oxalide (Ludovic Piot) et Kernel 42 (Edouard Fajnzilberg) à destination des niveaux débutants et intermédiaire. Le point de vue du Syadmin et du Dev en un seul atelier et avoir une vision globale du fonctionnement et de l'usage d'Elastic Search.
SPARQL-Generate - query and generate both RDF and text
Generate RDF or text from web documents in XML, JSON, CSV, GeoJSON, HTML, CBOR, plain text with regular expressions.
Generate RDF or text streams from large CSV documents, MQTT or WebSocket streams, repeated HTTP Get operations.
Directly generate HDT as the output of big transformations.
Use it as partial STTL and SPARQL Function implementations.
Extend it to support new data sources and formats.
SPARQL-Generate is an extension of SPARQL 1.1 for querying not only RDF datasets but also documents in arbitrary formats.
It offers a simple and expressive template-based option to generate RDF Graphs or text, from documents and different streams. It presents the following advantages:
Anyone familiar with SPARQL can easily learn SPARQL-Generate;
Learning SPARQL-Generate helps you learning SPARQL;
SPARQL-Generate leverages the expressivity of SPARQL 1.1: Aggregates, Solution Sequences and Modifiers, SPARQL functions and their extension mechanism.
It integrates seamlessly with existing standards for consuming Semantic Web data, such as SPARQL or Semantic Web programming frameworks.
Au cours de cette session, nous présenterons d'une manière générale toutes les caractéristiques qui font de symfony un framework open source tourné vers le monde professionnel.
Il s'agira dans un premier temps de montrer que symfony n'est pas seulement une base de code mais un projet open source à part entière disposant d'un écosystème riche sur lequel n'importe quel développeur peut compter.
Nous porterons ensuite un oeil plus attentif aux fonctionnalités phares du framework telles que sa couche d'abstraction de base de données, son interface en ligne de commande, le routing bidirectionnel, les outils de test automatisés ou bien encore le générateur automatique de backoffice.
Le "Lac de données" de l'Ina, un projet pour placer la donnée au cœur de l'or...Gautier Poupeau
Support de l'intervention effectuée au cours de la séance dédiée aux lacs de données du séminaire "Nouveaux paradigmes de l'Archive" organisée par le DICEN-CNAM et les Archives nationales
Visite guidée au pays de la donnée - Du modèle conceptuel au modèle physiqueGautier Poupeau
Ce diaporama est le 3ème d'une série qui vise à donner un panorama de la gestion des données à l'ère du big data et de l'intelligence artificielle. Cette partie s'attache à présenter comment on passe de la modélisation des données jusqu'à leur stockage. Elle dresse un panorama des différentes solutions de stockage de données, en présente les particularités, les forces et les faiblesses.
Visite guidée au pays de la donnée - Traitement automatique des donnéesGautier Poupeau
Ce diaporama est le 2ème d'une série qui vise à donner un panorama de la gestion des données à l'ère du big data et de l'intelligence artificielle. Cette 2ème partie présente le traitement automatique des données : intelligence artificielle, fouille de textes et de données, Traitement automarique de la langue ou des images. Après avoir défini ces différents domaines, cette présentation s'attache à faire le tour des différents outils disponibles pour analyser les contenus audiovisuels.
Visite guidée au pays de la donnée - Introduction et tour d'horizonGautier Poupeau
Ce diaporama est le 1er d'une série qui vise à donner un panorama de la gestion des données à l'ère du big data et de l'intelligence artificielle. Cette 1ère partie revient sur les raisons qui font de la donnée un actif indépendant de notre SI et propose une représentation de la gestion des données
Un modèle de données unique pour les collections de l'Ina, pourquoi ? Comment ?Gautier Poupeau
Support de l'intervention effectuée lors des lundis du numérique de l'INHA le 11 février 2019 sur le projet à l'institut national de l'audiovisuel d'une stratégie orientée données pour la refonte de notre système d'information basée sur la mise au point d'une infrastructure centralisée de stockage et de traitement des données et un modèle de données unique pour mettre en cohérence toutes les données de l'Ina
Big data, Intelligence artificielle, quelles conséquences pour les profession...Gautier Poupeau
Support du Webinaire organisé le 21 février par Ina Expert sur l'évolution du positionnement des professionnels de l'information dans les organisations face aux changements en cours que sont la montée en puissance des données au détriment du document, le big data et l'intelligence artificielle
Aligner vos données avec Wikidata grâce à l'outil Open RefineGautier Poupeau
Tutoriel sous la forme d'un pas à pas pour aligner des données avec Wikidata grâce à l'outil Open Refine. Dans ce tutoriel, les données alignées proviennent de la plateforme HAL récupérées via le Sparql endpoint.
Réalisation d'un mashup de données avec DSS de Dataiku et visualisation avec ...Gautier Poupeau
cf. la première partie : https://www.slideshare.net/lespetitescases/ralisation-dun-mashup-de-donnes-avec-dss-de-dataiku-premire-partie
Tutoriel pour réaliser un mashup à partir de jeux de données libres téléchargés sur data.gouv.fr et Wikidata entre autres avec le logiciel DSS de Dataiku. Cette deuxième partie permet d'aborder le requêtage de Wikidata avec une requête SPARQL puis montre comment relier les jeux de données de data.gouv.fr et les données issues de Wikidata. Enfin, il aborde la visualisation des données via l'application en ligne Palladio.
Ce tutoriel a servi de support de cours au Master 2 "Technologies numériques appliqués à l'histoire" de l'Ecole nationale des chartes lors de l'année universitaire 2016-2017.
Réalisation d'un mashup de données avec DSS de Dataiku - Première partieGautier Poupeau
Cf la seconde partie https://www.slideshare.net/lespetitescases/ralisation-dun-mashup-de-donnes-avec-dss-de-dataiku-et-visualisation-avec-palladio-deuxime-partie
Tutoriel pour réaliser un mashup à partir de jeux de données libres téléchargés sur data.gouv.fr et Wikidata entre autres avec le logiciel DSS de Dataiku. Après une introduction sur la notion de mashup et des exemples, cette première partie s'intéresse à la préparation de deux jeux de données issues de data.gouv.fr et provenant du Centre national du cinéma.
Ce tutoriel a servi de support de cours au Master 2 "Technologies numériques appliqués à l'histoire" de l'Ecole nationale des chartes lors de l'année universitaire 2016-2017.
Diaporama de la présentation faite lors du Talend Connect 2016 sur la stratégie orientée données déployée à l'Institut national de l'audiovisuel (Ina). Pour en savoir plus, vous pouvez lire ce billet de blog : http://www.lespetitescases.net/comment-mettre-la-donnee-au-coeur-du-si
Les technologies du Web appliquées aux données structurées (1ère partie : Enc...Gautier Poupeau
Diaporama de la présentation effectuée au séminaire INRIA IST "Le document à l'heure du Web de données" (Carnac 1er-5 octobre 2012) en compagnie d'Emmanuelle Bermès (aka figoblog)
Les technologies du Web appliquées aux données structurées (2ème partie : Rel...Gautier Poupeau
Diaporama de la présentation effectuée au séminaire INRIA IST "Le document à l'heure du Web de données" (Carnac 1er-5 octobre 2012) en compagnie d'Emmanuelle Bermès (aka figoblog)
Les professionnels de l'information face aux défis du Web de donnéesGautier Poupeau
Diaporama pour une communication donnée dans le cadre de la journée d'études ADBS-EDB, "Quel Web demain ?", 7 avril 2009, http://www.adbs.fr/quel-web-demain--57415.htm
How to use index to highlight social networks
in historical digital corpora ?
Présentation à Digital Humanities, 6 juillet 2006 (Paris).
Attention, c\'est un peu vieilli...
1. Découverte du
SPARQL Endpoint de HAL
Gautier Poupeau ,
gautier.poupeau@gmail.com
@lespetitescases
http://www.lespetitescases.net
2. Web sémantique dans HAL
• Un sparql endpoint
https://data.archives-ouvertes.fr/sparql
dont la documentation est disponible à
https://data.archives-ouvertes.fr/doc/schema
• Des dumps calculés une fois par mois
https://data.archives-ouvertes.fr/backup
• Pas de négociation de contenu
https://data.archives-ouvertes.fr/
3. Le modèle des documents (1)
Exemple : https://data.archives-ouvertes.fr/document/hal-00000001v2
4. Le modèle des documents (2)
Document
versionné
document/
hal-00000001v2
Document
document/
hal-00000001
Fichier PDF
hal-
00000001v2/file/mq-
anglais.pdf
dcterms:hasVersion
ore:aggregates
Type de
document
Fabio URI
rdf:type
Auteur
Nœud blanc
dcterms:creator
Forme de
l’auteur
author/63529
hal:person
Stringfoaf:name
5. Astuce 1
trouver l’URI correspondant à un préfixe
Vous saisissez le préfixe
Le système renvoie une ou
plusieurs URIs
correspondantes
7. Rechercher les métadonnées associées à toutes les versions d’un document
URI utile
Document : https://data.archives-ouvertes.fr/document/inria-00362381
Exercice 1
select ?version ?p ?object
where
{
doc:inria-00362381 dcterms:hasVersion ?version.
?version ?p ?object.
}
8. Rechercher l’URI, le titre, le lien vers le PDF et les auteurs des différentes version du
document
URI utile
Document : https://data.archives-ouvertes.fr/document/inria-00362381
Exercice 2
select ?version ?title ?pdf ?person ?name
where
{
doc:inria-00362381 dcterms:hasVersion ?version.
?version dcterms:title ?title; dcterms:creator ?creator; ore:aggregates ?pdf.
?creator hal:person ?person.
?person foaf:name ?name
}
9. Astuce 3
concaténer les réponses d’une variable
select ?version ?title ?pdf GROUP_CONCAT(?name,',') AS ?authors
where
{
doc:inria-00362381 dcterms:hasVersion ?version.
?version dcterms:title ?title; dcterms:creator ?creator; ore:aggregates ?pdf.
?creator hal:person ?person.
?person foaf:name ?name
}
GROUP BY ?version ?title ?pdf
On regroupe les résultats par tous les autres éléments de la requête
On utilise le mot-clé GROUP_CONCAT(?variable,séparateur)
10. Type de dépôt et domaine
Exemple : https://data.archives-ouvertes.fr/doctype/Article
11. Rechercher l’URI et l’étiquette préférentielle des types de documents correspondants aux
différentes version d’un document
URI utile
Document : https://data.archives-ouvertes.fr/document/inria-00362381
Exercice 3
select ?version ?type ?label
where
{
doc:inria-00362381 dcterms:hasVersion ?version.
?version dcterms:type ?type.
?type skos:prefLabel ?label.
}
12. Astuce 4
Afficher uniquement la chaîne en français
select ?version ?type ?label
where
{
doc:inria-00362381 dcterms:hasVersion ?version.
?version dcterms:type ?type.
?type skos:prefLabel ?label.
FILTER (lang(?label)='fr')
}
Utilisation du mot-clé « FILTER »
13. Afficher l’étiquette préférentielle de tous les types de documents distincts avec la forme
« Fabien Gandon » pour auteur
URI utile
Forme auteur : https://data.archives-ouvertes.fr/author/827904
Exercice 4
select DISTINCT ?label
where
{
?document dcterms:hasVersion ?version.
?version dcterms:creator ?creator; dcterms:type ?type.
?creator hal:person <https://data.archives-ouvertes.fr/author/827904>.
?type skos:prefLabel ?label.
FILTER (lang(?label)='fr')
}
14. Astuce 5
Utiliser les opérateurs d’agrégation
count, sum, avg, min, max
select count(distinct ?person) AS ?nbPerson
where
{
doc:inria-00362381 dcterms:hasVersion ?version.
?version dcterms:creator ?creator.
?creator hal:person ?person
}
Rechercher le nombre d’auteurs d’un document toutes versions confondues
URI utile
Document : https://data.archives-ouvertes.fr/document/inria-00362381
15. Astuce 5
Utiliser les opérateurs d’agrégation
count, sum, avg, min, max
select ?version, count(distinct ?person) AS ?nbperson
where
{
doc:inria-00362381 dcterms:hasVersion ?version.
?version dcterms:creator ?creator.
?creator hal:person ?person
}
GROUP BY ?version
Rechercher le nombre d’auteurs par version d’un document
URI utile
Document : https://data.archives-ouvertes.fr/document/inria-00362381
16. Afficher le nombre de documents pour chaque types de documents distincts avec la forme
« Fabien Gandon » pour auteur
URI utile
Forme auteur : https://data.archives-ouvertes.fr/author/827904
Exercice 5
select ?type ?label count(distinct ?document)
where
{
?document dcterms:hasVersion ?version.
?version dcterms:creator ?creator; dcterms:type ?type.
?creator hal:person <https://data.archives-ouvertes.fr/author/827904>.
?type skos:prefLabel ?label.
FILTER (lang(?label)='fr')
}
GROUP BY ?type ?label
17. Astuce 6
Ordonner un résultat
select ?type ?label count(distinct ?document) AS ?nbdoc
where
{
?document dcterms:hasVersion ?version.
?version dcterms:creator ?creator; dcterms:type ?type.
?creator hal:person <https://data.archives-ouvertes.fr/author/827904>.
?type skos:prefLabel ?label.
FILTER (lang(?label)='fr')
}
GROUP BY ?type ?label
ORDER BY DESC(?nbdoc)
Afficher de manière décroissante le nombre de documents pour chaque types de
documents distincts avec la forme « Fabien Gandon » pour auteur
URI utile
Document : https://data.archives-ouvertes.fr/document/inria-00362381
18. Astuce 7
Manipuler les dates
Afficher le nombre de document associé à la forme « Fabrien Gandon » par an
URI utile
Forme auteur : https://data.archives-ouvertes.fr/author/827904
select year(?date) count(distinct ?document)
where
{
?document dcterms:hasVersion ?version.
?version dcterms:creator ?creator; dcterms:issued ?date.
?creator hal:person <https://data.archives-ouvertes.fr/author/827904>.
}
GROUP BY year(?date)
ORDER BY year(?date)
19. Afficher le nombre de documents pour chaque types de documents et chaque année avec
la forme « Fabien Gandon » pour auteur et ordonnée par année
URI utile
Forme auteur : https://data.archives-ouvertes.fr/author/827904
Exercice 7
select year(?date) ?label count(distinct ?document)
where
{
?document dcterms:hasVersion ?version.
?version dcterms:creator ?creator; dcterms:issued ?date; dcterms:type ?type.
?creator hal:person <https://data.archives-ouvertes.fr/author/827904>.
?type skos:prefLabel ?label.
FILTER (lang(?label)='fr')
}
GROUP BY year(?date) ?label
ORDER BY year(?date)
20. Modèle pour la forme d’un auteur
Exemple : https://data.archives-ouvertes.fr/author/827904
21. Afficher toutes les métadonnées associées à la forme « Fabien Gandon »
URI utile
Forme auteur : https://data.archives-ouvertes.fr/author/827904
Exercice 8
select ?p ?o
where
{
<https://data.archives-ouvertes.fr/author/827904> ?p ?o
}
22. Afficher les URIs de toutes les formes associées à la forme « Fabien Gandon » via l’IdHAL
URI utile
Forme auteur : https://data.archives-ouvertes.fr/author/827904
Exercice 9
select ?forme
where
{
<https://data.archives-ouvertes.fr/author/827904> ore:isAggregatedBy ?o.
?forme ore:isAggregatedBy ?o.
}
23. Afficher le nombre de documents associés à chaque forme associés à l’IdHAL relié à la
forme « Fabien Gandon »
URI utile
Forme auteur : https://data.archives-ouvertes.fr/author/827904
Exercice 10
select ?forme count(DISTINCT ?document)
where
{
<https://data.archives-ouvertes.fr/author/827904> ore:isAggregatedBy ?o.
?forme ore:isAggregatedBy ?o.
?creator hal:person ?forme.
?version dcterms:creator ?creator.
?document dcterms:hasVersion ?version
}
GROUP BY ?forme
24. Afficher le nombre de documents associés à chaque forme associés à l’IdHAL relié à la
forme Fabien Gandon par an
URI utile
Forme auteur : https://data.archives-ouvertes.fr/author/827904
Exercice 11
select year(?date) AS ?year count(DISTINCT ?document) AS ?nbdoc
where
{
<https://data.archives-ouvertes.fr/author/827904> ore:isAggregatedBy ?o.
?forme ore:isAggregatedBy ?o.
?creator hal:person ?forme.
?version dcterms:creator ?creator; dcterms:issued ?date.
?document dcterms:hasVersion ?version
}
GROUP BY year(?date)
25. Astuce 8
Requête et sous requête
Afficher la moyenne des documents par an associés à toutes les formes liées à « Fabien
Gandon »
URI utile
Forme auteur : https://data.archives-ouvertes.fr/author/827904
SELECT avg(?nbdoc)
WHERE {
select year(?date) AS ?year count(DISTINCT ?document) AS ?nbdoc
where
{
<https://data.archives-ouvertes.fr/author/827904> ore:isAggregatedBy ?o.
?forme ore:isAggregatedBy ?o.
?creator hal:person ?forme.
?version dcterms:creator ?creator; dcterms:issued ?date.
?document dcterms:hasVersion ?version
}
GROUP BY year(?date)
}
26. Afficher tous les identifiants externes liées à Fabien Gandon
URI utile
Forme auteur : https://data.archives-ouvertes.fr/author/827904
Exercice 12
select DISTINCT ?idhal ?identifiant
where
{
<https://data.archives-ouvertes.fr/author/827904> ore:isAggregatedBy ?idhal.
?forme ore:isAggregatedBy ?idhal.
?forme owl:sameAs ?identifiant
}
27. Astuce 9
Faire une recherche dans une chaîne de caractères
Afficher les identifiants ORCID distincts liés à toutes les formes liées à « Fabien Gandon »
URI utile
Forme auteur : https://data.archives-ouvertes.fr/author/827904
select DISTINCT ?idhal ?orcid
where
{
<https://data.archives-ouvertes.fr/author/827904> ore:isAggregatedBy ?idhal.
?forme ore:isAggregatedBy ?idhal.
?forme owl:sameAs ?orcid.
FILTER regex(str(?orcid), 'orcid.org')
}
28. Afficher les identifiants ORCID de tous les membres d’INRIA et pour chaque ORCID,
concaténer les différentes URIs des formes associées
URI utile
INRIA : https://data.archives-ouvertes.fr/structure/300009
Exercice 13
select ?orcid ?name GROUP_CONCAT(str(?forme),',')
where
{
?forme foaf:member <https://data.archives-ouvertes.fr/structure/300009>;
foaf:name ?name;
owl:sameAs ?orcid.
FILTER regex(str(?orcid),'orcid.org')
}
GROUP BY ?orcid ?name
ORDER BY ?name
29. Afficher tous les membres d’INRIA et pour chaque membre, éventuellement l’identifiant
ORCID et l’identifiant idHal
URI utile
INRIA : https://data.archives-ouvertes.fr/structure/300009
Astuce 10
OPTIONAL
select DISTINCT ?name ?orcid ?idHal
where
{
?forme foaf:member <https://data.archives-ouvertes.fr/structure/300009>;
foaf:name ?name.
OPTIONAL {?forme owl:sameAs ?orcid. FILTER regex(?orcid,'orcid.org')}
OPTIONAL {?forme ore:isAggregatedBy ?idHal}
}
ORDER BY ?name
30. Astuce 11
La négation
Afficher les membres qui ont un identifiant idHAL mais pas d’ORCID
URI utile
INRIA : https://data.archives-ouvertes.fr/structure/300009
select DISTINCT ?idHal ?name
where
{
?forme foaf:member <https://data.archives-ouvertes.fr/structure/300009>;
foaf:name ?name; ore:isAggregatedBy ?idHal
FILTER NOT EXISTS {?forme owl:sameAs ?orcid. FILTER regex(?orcid,'orcid.org')}
}
ORDER BY ?name
31. Afficher tous les membres d’INRIA qui n’ont pas d’identifiant idHAL et concaténés par la
forme
URI utile
INRIA : https://data.archives-ouvertes.fr/structure/300009
Exercice 14
select ?name, GROUP_CONCAT(str(?forme), '|')
where
{
?forme foaf:member <https://data.archives-ouvertes.fr/structure/300009>;
foaf:name ?name.
FILTER NOT EXISTS {?forme ore:isAggregatedBy ?idHal}
}
GROUP BY ?name
ORDER BY ?name
32. Afficher tous les domaines des membres d’INRIA et pour chaque domaine le nombre de
documents associés ordonné de manière décroissante
URI utile
INRIA : https://data.archives-ouvertes.fr/structure/300009
Exercice 15
select ?topic ?prefLabel count(DISTINCT ?document) AS ?nbdoc
where
{
?forme foaf:member <https://data.archives-ouvertes.fr/structure/300009>.
?version hal:topic ?topic; dcterms:creator ?creator.
?creator hal:person ?forme.
?topic skos:prefLabel ?prefLabel.
?document dcterms:hasVersion ?version.
FILTER (lang(?prefLabel)='fr')
}
GROUP BY ?topic ?prefLabel
ORDER BY DESC(?nbdoc)
33. Afficher tous les co-auteurs de Fabien Gandon et le nombre d’articles pour chacun d’entre
eux
URI utile
idHal de Fabien Gandon : https://data.archives-ouvertes.fr/structure/300009
Exercice 16
select ?name count(DISTINCT ?document) AS ?nbdoc
where
{
?forme ore:isAggregatedBy <https://data.archives-ouvertes.fr/author/fabien-
gandon>.
?version hal:topic ?topic; dcterms:creator ?creator.
?creator hal:person ?forme.
?document dcterms:hasVersion ?version.
?version dcterms:creator ?autresCreator.
?autresCreator hal:person ?autresformes.
?autresformes foaf:name ?name.
FILTER (?name !='Fabien Gandon')
}
GROUP BY ?name
ORDER BY DESC(?nbdoc)
34. Astuce 12
Limiter les résultats de comptage
Afficher tous les co-auteurs de Fabien Gandon et le nombre d’articles pour chacun d’entre
eux et ayant plus de 1 document en commun
URI utile
idHal de Fabien Gandon : https://data.archives-ouvertes.fr/author/fabien-gandon
select ?name count(DISTINCT ?document) AS ?nbdoc
where
{
?forme ore:isAggregatedBy <https://data.archives-ouvertes.fr/author/fabien-
gandon>.
?version hal:topic ?topic; dcterms:creator ?creator.
?creator hal:person ?forme.
?document dcterms:hasVersion ?version.
?version dcterms:creator ?autresCreator.
?autresCreator hal:person ?autresformes.
?autresformes foaf:name ?name.
FILTER (?name !='Fabien Gandon')
}
GROUP BY ?name
HAVING (count(DISTINCT ?document) > 1)
ORDER BY DESC(?nbdoc)
35. Modèle pour les personnes IdHAL
C’est beau mais malheureusement,
ce n’est pas dans le triple store
36. Modèle pour les structures
Exemple : https://data.archives-ouvertes.fr/structure/300009
37. Afficher toutes les organisations et leur terme préférentiel
URI utile
Classe Organisation : http://www.w3.org/ns/org#Organization
INRIA : https://data.archives-ouvertes.fr/structure/300009
Exercice 17
select ?structure ?prefLabel
where
{
?structure a org:Organization; skos:prefLabel ?prefLabel.
}
38. Modèle pour les revues
https://data.archives-ouvertes.fr/revue/109707
39. Modèles pour les projets
Exemple : https://data.archives-ouvertes.fr/anrProject/1001
https://data.archives-ouvertes.fr/europeanProject/129494
40. Afficher toutes les projets ANR dont sont issues les documents écrits par des membres de
l’INRIA
URI utile
Classe projet : http://www.eurocris.org/ontologies/cerif/1.3/
INRIA : https://data.archives-ouvertes.fr/structure/300009
Exercice 18
select DISTINCT ?projet ?title ?acronym ?startDate
where
{
?forme foaf:member <https://data.archives-ouvertes.fr/structure/300009>.
?version dcterms:source ?projet; dcterms:creator ?creator.
?creator hal:person ?forme.
?projet a cerif:Project; cerif:title ?title; cerif:startDate ?startDate.
OPTIONAL {?projet cerif:acronym ?acronym}
FILTER regex(str(?projet),'anrProject')
}
ORDER BY DESC(?startDate)
41. Afficher toutes les projets ANR dont sont issues les documents écrits par des membres de
l’INRIA et pour chacun le nombre de documents associés
URI utile
Classe projet : http://www.eurocris.org/ontologies/cerif/1.3/
INRIA : https://data.archives-ouvertes.fr/structure/300009
Exercice 19
select ?projet ?title ?acronym ?startDate count(DISTINCT ?document) AS ?nbdoc
where
{
?forme foaf:member <https://data.archives-ouvertes.fr/structure/300009>.
?version dcterms:source ?projet; dcterms:creator ?creator.
?creator hal:person ?forme.
?projet a cerif:Project; cerif:title ?title; cerif:startDate ?startDate.
OPTIONAL {?projet cerif:acronym ?acronym}
FILTER regex(str(?projet),'anrProject')
?document dcterms:hasVersion ?version.
}
GROUP BY ?projet ?title ?acronym ?startDate
ORDER BY DESC(?nbdoc)