The document provides an overview of RDF stream processing, including:
- Extending the RDF data model to represent RDF streams and associate application times to data items
- Modeling continuous query evaluation over RDF streams using the CQL/STREAM model of mapping streams to relations and using sliding windows
- How existing systems extend CQL with operators for mapping between RDF streams and relations and for evaluating continuous SPARQL queries over windows of streaming RDF data.
IPython Notebook as a Unified Data Science Interface for HadoopDataWorks Summit
This document discusses using IPython Notebook as a unified data science interface for Hadoop. It proposes that a unified environment needs: 1) mixed local and distributed processing via Apache Spark, 2) access to languages like Python via PySpark, 3) seamless SQL integration via SparkSQL, and 4) visualization and reporting via IPython Notebook. The document demonstrates this environment by exploring open payments data between doctors/hospitals and manufacturers.
Python can be used for big data applications and processing on Hadoop. Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the distributed processing of large datasets across clusters of computers using simple programming models. MapReduce is a programming model used in Hadoop for processing and generating large datasets in a distributed computing environment.
This document discusses Apache Pig and its role in data science. It begins with an introduction to Pig, describing it as a high-level scripting language for operating on large datasets in Hadoop. It transforms data operations into MapReduce/Tez jobs and optimizes the number of jobs required. The document then covers using Pig for understanding data through statistics and sampling, machine learning by sampling large datasets and applying models with UDFs, and natural language processing on large unstructured data.
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Sonal Raj
This talk briefly outlines the Storm framework and Neo4J graph database, and how to compositely use them to perform computations on complex graphs in Python using the Petrel and Py2neo packages. This talk was given at PyCon India 2013.
This document outlines a workshop on Apache Spark given by Dr. Thanachart Numnonda. The workshop covers launching an Azure instance, installing Docker and Cloudera Quickstart, importing and exporting data to HDFS, connecting to the master node via SSH, and an introduction to Spark including RDDs and transformations. Hands-on exercises are provided to demonstrate importing data, connecting to nodes, and using HDFS and Spark APIs.
Towards Answering Provenance-Enabled SPARQL Queries over RDF Data CubesKim Ahlstrøm
The SPARQL 1.1 standard has made it possible to formulate analytical queries in SPARQL. While some approaches have become available for processing analytical queries on RDF data cubes, little attention has been paid to answering provenance-enabled queries over such data. Yet, considering provenance is a prerequisite to being able to validate if a query result is trustworthy. The main challenge for existing triple stores is the way provenance can be encoded in standard triple stores based on context values (named graphs). Hence, in this paper, we analyze the suitability of existing triple stores for answering provenance-enabled queries on RDF data cubes, identify their shortcomings, and propose an index to handle the high number of context values that provenance encoding typically entails. Our experimental results using the Star Schema Benchmark show the feasibility and scalability of our index and query evaluation strategies.
IPython Notebook as a Unified Data Science Interface for HadoopDataWorks Summit
This document discusses using IPython Notebook as a unified data science interface for Hadoop. It proposes that a unified environment needs: 1) mixed local and distributed processing via Apache Spark, 2) access to languages like Python via PySpark, 3) seamless SQL integration via SparkSQL, and 4) visualization and reporting via IPython Notebook. The document demonstrates this environment by exploring open payments data between doctors/hospitals and manufacturers.
Python can be used for big data applications and processing on Hadoop. Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It allows for the distributed processing of large datasets across clusters of computers using simple programming models. MapReduce is a programming model used in Hadoop for processing and generating large datasets in a distributed computing environment.
This document discusses Apache Pig and its role in data science. It begins with an introduction to Pig, describing it as a high-level scripting language for operating on large datasets in Hadoop. It transforms data operations into MapReduce/Tez jobs and optimizes the number of jobs required. The document then covers using Pig for understanding data through statistics and sampling, machine learning by sampling large datasets and applying models with UDFs, and natural language processing on large unstructured data.
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Sonal Raj
This talk briefly outlines the Storm framework and Neo4J graph database, and how to compositely use them to perform computations on complex graphs in Python using the Petrel and Py2neo packages. This talk was given at PyCon India 2013.
This document outlines a workshop on Apache Spark given by Dr. Thanachart Numnonda. The workshop covers launching an Azure instance, installing Docker and Cloudera Quickstart, importing and exporting data to HDFS, connecting to the master node via SSH, and an introduction to Spark including RDDs and transformations. Hands-on exercises are provided to demonstrate importing data, connecting to nodes, and using HDFS and Spark APIs.
Towards Answering Provenance-Enabled SPARQL Queries over RDF Data CubesKim Ahlstrøm
The SPARQL 1.1 standard has made it possible to formulate analytical queries in SPARQL. While some approaches have become available for processing analytical queries on RDF data cubes, little attention has been paid to answering provenance-enabled queries over such data. Yet, considering provenance is a prerequisite to being able to validate if a query result is trustworthy. The main challenge for existing triple stores is the way provenance can be encoded in standard triple stores based on context values (named graphs). Hence, in this paper, we analyze the suitability of existing triple stores for answering provenance-enabled queries on RDF data cubes, identify their shortcomings, and propose an index to handle the high number of context values that provenance encoding typically entails. Our experimental results using the Star Schema Benchmark show the feasibility and scalability of our index and query evaluation strategies.
Hadoop Summit 2016 presentation.
As Yahoo continues to grow and diversify its mobile products, its platform team faces an ever-expanding, sophisticated group of analysts and product managers, who demand faster answers: to measure key engagement metrics like Daily Active Users, app installs and user retention; to run longitudinal analysis; and to re-calculate metrics over rolling time windows. All of this, quickly enough not to need a coffee break while waiting for results. The optimal solution for this use-case would have to take into account raw performance, cost, security implications and ease of data management. Could the benefits of Hive, ORC and Tez, coupled with a good data design provide the performance our customers crave? Or would it make sense to use more nascent, off-grid querying systems? This talk will examine the efficacy of using Hive for large-scale mobile analytics. We will quantify Hive performance on a traditional, shared, multi-tenant Hadoop cluster, and compare it with more specialized analytics tools on a single-tenant cluster. We will also highlight which tuning parameters yield maximum benefits, and analyze the surprisingly ineffectual ones. Finally, we will detail several enhancements made by Yahoo's Hive team (in split calculation, stripe elimination and the metadata system) to successfully boost performance.
The document discusses using Python with Hadoop frameworks. It introduces Hadoop Distributed File System (HDFS) and MapReduce, and how to use the mrjob library to write MapReduce jobs in Python. It also covers using Python with higher-level Hadoop frameworks like Pig, accessing HDFS with snakebite, and using Python clients for HBase and the PySpark API for the Spark framework. Key advantages discussed are Python's rich ecosystem and ability to access Hadoop frameworks.
Slides for the tutorial about Apache Giraph for the Data Mining class.
Sapienza, University of Rome.
Master of Science in Engineering in Computer Science
Prof. A. Anagnostopoulos, I. Chatzigiannakis, A. Gionis
Data Mining class
Fall 2016
Big Data Analytics Using Hadoop Cluster On Amazon EMRIMC Institute
This document outlines steps for running a hands-on workshop on using Hadoop and Amazon EMR. It includes instructions for creating an AWS account and EMR cluster, importing sample data, writing and running a MapReduce word count program in Eclipse, and using Hive to create tables and query data on the EMR cluster. The workshop covers fundamental Hadoop and Hive concepts and commands to analyze big data on Amazon EMR.
Analyse Tweets using Flume 1.4, Hadoop 2.7 and HiveIMC Institute
This document outlines steps to analyze tweets using Flume, Hadoop, and Hive. It describes installing Flume and a jar file, creating a Twitter application to get API keys, configuring an Flume agent to fetch tweets from Twitter and store in HDFS, and using Hive to analyze the tweets by user followers count. The workshop instructions provide commands to stream tweets from Twitter to HDFS using Flume, view the stored tweets files, and run queries in Hive to find the user with the most followers.
Fast, Scalable Graph Processing: Apache Giraph on YARNDataWorks Summit
Apache Giraph performs offline, batch processing of very large graph datasets on top of a Hadoop cluster. Giraph replaces iterative MapReduce-style solutions with Bulk Synchronous Parallel graph processing using in-memory or disk-based data sets, loosely following the model of Google`s Pregel. Many recent advances have left Giraph more robust, efficient, fast, and able to accept a variety of I/O formats typical for graph data in and out of the Hadoop ecosystem. Giraph's recent port to a pure YARN platform offers increased performance, fine-grained resource control, and scalability that Giraph atop Hadoop MRv1 cannot, while paving the way for ports to other platforms like Apache Mesos. Come see whats on the roadmap for Giraph, what Giraph on YARN means, and how Giraph is leveraging the power of YARN to become a more robust, usable, and useful platform for processing Big Graph datasets.
Hadoop Workshop using Cloudera on Amazon EC2IMC Institute
This document provides instructions for a hands-on workshop on installing and using Hadoop and Cloudera on Amazon EC2. It outlines the steps to launch an EC2 virtual server instance, install Cloudera Manager and Cloudera Express Edition, import and export data from HDFS, write MapReduce programs in Eclipse, and use various Hadoop tools like HDFS and Hue. The workshop is led by Dr. Thanachart Numnonda and aims to teach participants how to set up their own Hadoop cluster on EC2 and start using Hadoop for big data tasks.
Architecting applications with Hadoop - Fraud Detectionhadooparchbook
This document discusses architectures for fraud detection applications using Hadoop. It provides an overview of requirements for such an application, including the need for real-time alerts and batch processing. It proposes using Kafka for ingestion due to its high throughput and partitioning. HBase and HDFS would be used for storage, with HBase better supporting random access for profiles. The document outlines using Flume, Spark Streaming, and HBase for near real-time processing and alerting on incoming events. Batch processing would use HDFS, Impala, and Spark. Caching profiles in memory is also suggested to improve performance.
These are slides from a lecture given at the UC Berkeley School of Information for the Analyzing Big Data with Twitter class. A video of the talk can be found at http://blogs.ischool.berkeley.edu/i290-abdt-s12/2012/08/31/video-lecture-posted-intro-to-hadoop/
The document discusses new features in Pig including macros, debugging tools, UDF support for scripting languages, embedding Pig in other languages, and new operators like nested CROSS and FOREACH. Examples are provided for macros, debugging with Pig Illustrate, writing UDFs in Python and Ruby, running Pig from Python, and using nested operators. Future additions mentioned are RANK and CUBE operators.
The document provides an overview of Hadoop, an open-source software framework for distributed storage and processing of large datasets. It describes how Hadoop uses HDFS for distributed file storage across clusters and MapReduce for parallel processing of data. Key components of Hadoop include HDFS for storage, YARN for resource management, and MapReduce for distributed computing. The document also discusses some popular Hadoop distributions and real-world uses of Hadoop by companies.
Jerven Bolleman, Lead Software Developer at Swiss-Prot Group, explained why are they offering a free SPARQL and RDF endpoint for the world to use and why is it hard to optimize it.
The document introduces JStorm, an open source distributed real-time computation framework. It was created by Alibaba to address issues with Apache Storm and improve performance for real-time applications. JStorm has been used by Alibaba to process over 3 trillion messages per day across 3000+ servers. Key features discussed include high throughput, fault tolerance, horizontal scalability, and more powerful scheduling capabilities compared to Storm.
How does that PySpark thing work? And why Arrow makes it faster?Rubén Berenguel
Back in ye olde days of Spark, using Python with Spark was an exercise in patience. Data was moving up and down from Python to Scala, being serialised constantly. Leveraging SparkSQL and avoiding UDFs made things better, as well as the constant improvement of the optimisers (Catalyst and Tungsten). But, with Spark 2.3 PySpark has speed up tremendously thanks to the (still experimental) addition of the Arrow serialisers.
In this talk we will learn how PySpark has improved its performance in Apache Spark 2.3 by using Apache Arrow. To do this, we will travel through the internals of Spark to find how Python interacts with the Scala core, and some of the internals of Pandas to see how data moves from Python to Scala via Arrow.
https://github.com/rberenguel/pyspark-arrow-pandas
Revolution R Enterprise - Portland R User Group, November 2013Revolution Analytics
This document provides a summary of a presentation given at the Revolution R Enterprise User Group meeting in Portland on November 13, 2013. The presentation introduced Revolution R Enterprise as a platform for performing high performance big data analytics using the R programming language. Key capabilities of Revolution R Enterprise include parallel and distributed computing, integration with various data sources, and deploying models and results across different platforms. A demo was given showing how Revolution R Enterprise can be used to analyze large datasets using algorithms pre-parallelized for high performance.
Rapid final presentation the illuminatorsEd Tackett
1. The document describes a solar-powered street sign illuminator to help drivers more easily see residential street signs at night that are too high to be illuminated by car headlights.
2. The proposed solution is a device that can be retrofitted to existing street signs, using solar power to stay lit for around 10 hours and making signs more visible.
3. The design went through several iterations to improve lighting of dual signs, battery capacity, and durability for an estimated manufacturing cost of $101 per unit to sell at $200.
RSEP-QL: A Query Model to Capture Event Pattern Matching in RDF Stream Proces...Daniele Dell'Aglio
This document proposes a query model called RSEP-QL to capture event pattern matching in RDF stream processing languages. It presents RSEP-QL's data model of RDF streams and windows, basic operators like EVENT and SEQ, and evaluation semantics. The goal is to provide a reference model for comparing different RSP query languages and studying related problems in a standardized way.
Hadoop Summit 2016 presentation.
As Yahoo continues to grow and diversify its mobile products, its platform team faces an ever-expanding, sophisticated group of analysts and product managers, who demand faster answers: to measure key engagement metrics like Daily Active Users, app installs and user retention; to run longitudinal analysis; and to re-calculate metrics over rolling time windows. All of this, quickly enough not to need a coffee break while waiting for results. The optimal solution for this use-case would have to take into account raw performance, cost, security implications and ease of data management. Could the benefits of Hive, ORC and Tez, coupled with a good data design provide the performance our customers crave? Or would it make sense to use more nascent, off-grid querying systems? This talk will examine the efficacy of using Hive for large-scale mobile analytics. We will quantify Hive performance on a traditional, shared, multi-tenant Hadoop cluster, and compare it with more specialized analytics tools on a single-tenant cluster. We will also highlight which tuning parameters yield maximum benefits, and analyze the surprisingly ineffectual ones. Finally, we will detail several enhancements made by Yahoo's Hive team (in split calculation, stripe elimination and the metadata system) to successfully boost performance.
The document discusses using Python with Hadoop frameworks. It introduces Hadoop Distributed File System (HDFS) and MapReduce, and how to use the mrjob library to write MapReduce jobs in Python. It also covers using Python with higher-level Hadoop frameworks like Pig, accessing HDFS with snakebite, and using Python clients for HBase and the PySpark API for the Spark framework. Key advantages discussed are Python's rich ecosystem and ability to access Hadoop frameworks.
Slides for the tutorial about Apache Giraph for the Data Mining class.
Sapienza, University of Rome.
Master of Science in Engineering in Computer Science
Prof. A. Anagnostopoulos, I. Chatzigiannakis, A. Gionis
Data Mining class
Fall 2016
Big Data Analytics Using Hadoop Cluster On Amazon EMRIMC Institute
This document outlines steps for running a hands-on workshop on using Hadoop and Amazon EMR. It includes instructions for creating an AWS account and EMR cluster, importing sample data, writing and running a MapReduce word count program in Eclipse, and using Hive to create tables and query data on the EMR cluster. The workshop covers fundamental Hadoop and Hive concepts and commands to analyze big data on Amazon EMR.
Analyse Tweets using Flume 1.4, Hadoop 2.7 and HiveIMC Institute
This document outlines steps to analyze tweets using Flume, Hadoop, and Hive. It describes installing Flume and a jar file, creating a Twitter application to get API keys, configuring an Flume agent to fetch tweets from Twitter and store in HDFS, and using Hive to analyze the tweets by user followers count. The workshop instructions provide commands to stream tweets from Twitter to HDFS using Flume, view the stored tweets files, and run queries in Hive to find the user with the most followers.
Fast, Scalable Graph Processing: Apache Giraph on YARNDataWorks Summit
Apache Giraph performs offline, batch processing of very large graph datasets on top of a Hadoop cluster. Giraph replaces iterative MapReduce-style solutions with Bulk Synchronous Parallel graph processing using in-memory or disk-based data sets, loosely following the model of Google`s Pregel. Many recent advances have left Giraph more robust, efficient, fast, and able to accept a variety of I/O formats typical for graph data in and out of the Hadoop ecosystem. Giraph's recent port to a pure YARN platform offers increased performance, fine-grained resource control, and scalability that Giraph atop Hadoop MRv1 cannot, while paving the way for ports to other platforms like Apache Mesos. Come see whats on the roadmap for Giraph, what Giraph on YARN means, and how Giraph is leveraging the power of YARN to become a more robust, usable, and useful platform for processing Big Graph datasets.
Hadoop Workshop using Cloudera on Amazon EC2IMC Institute
This document provides instructions for a hands-on workshop on installing and using Hadoop and Cloudera on Amazon EC2. It outlines the steps to launch an EC2 virtual server instance, install Cloudera Manager and Cloudera Express Edition, import and export data from HDFS, write MapReduce programs in Eclipse, and use various Hadoop tools like HDFS and Hue. The workshop is led by Dr. Thanachart Numnonda and aims to teach participants how to set up their own Hadoop cluster on EC2 and start using Hadoop for big data tasks.
Architecting applications with Hadoop - Fraud Detectionhadooparchbook
This document discusses architectures for fraud detection applications using Hadoop. It provides an overview of requirements for such an application, including the need for real-time alerts and batch processing. It proposes using Kafka for ingestion due to its high throughput and partitioning. HBase and HDFS would be used for storage, with HBase better supporting random access for profiles. The document outlines using Flume, Spark Streaming, and HBase for near real-time processing and alerting on incoming events. Batch processing would use HDFS, Impala, and Spark. Caching profiles in memory is also suggested to improve performance.
These are slides from a lecture given at the UC Berkeley School of Information for the Analyzing Big Data with Twitter class. A video of the talk can be found at http://blogs.ischool.berkeley.edu/i290-abdt-s12/2012/08/31/video-lecture-posted-intro-to-hadoop/
The document discusses new features in Pig including macros, debugging tools, UDF support for scripting languages, embedding Pig in other languages, and new operators like nested CROSS and FOREACH. Examples are provided for macros, debugging with Pig Illustrate, writing UDFs in Python and Ruby, running Pig from Python, and using nested operators. Future additions mentioned are RANK and CUBE operators.
The document provides an overview of Hadoop, an open-source software framework for distributed storage and processing of large datasets. It describes how Hadoop uses HDFS for distributed file storage across clusters and MapReduce for parallel processing of data. Key components of Hadoop include HDFS for storage, YARN for resource management, and MapReduce for distributed computing. The document also discusses some popular Hadoop distributions and real-world uses of Hadoop by companies.
Jerven Bolleman, Lead Software Developer at Swiss-Prot Group, explained why are they offering a free SPARQL and RDF endpoint for the world to use and why is it hard to optimize it.
The document introduces JStorm, an open source distributed real-time computation framework. It was created by Alibaba to address issues with Apache Storm and improve performance for real-time applications. JStorm has been used by Alibaba to process over 3 trillion messages per day across 3000+ servers. Key features discussed include high throughput, fault tolerance, horizontal scalability, and more powerful scheduling capabilities compared to Storm.
How does that PySpark thing work? And why Arrow makes it faster?Rubén Berenguel
Back in ye olde days of Spark, using Python with Spark was an exercise in patience. Data was moving up and down from Python to Scala, being serialised constantly. Leveraging SparkSQL and avoiding UDFs made things better, as well as the constant improvement of the optimisers (Catalyst and Tungsten). But, with Spark 2.3 PySpark has speed up tremendously thanks to the (still experimental) addition of the Arrow serialisers.
In this talk we will learn how PySpark has improved its performance in Apache Spark 2.3 by using Apache Arrow. To do this, we will travel through the internals of Spark to find how Python interacts with the Scala core, and some of the internals of Pandas to see how data moves from Python to Scala via Arrow.
https://github.com/rberenguel/pyspark-arrow-pandas
Revolution R Enterprise - Portland R User Group, November 2013Revolution Analytics
This document provides a summary of a presentation given at the Revolution R Enterprise User Group meeting in Portland on November 13, 2013. The presentation introduced Revolution R Enterprise as a platform for performing high performance big data analytics using the R programming language. Key capabilities of Revolution R Enterprise include parallel and distributed computing, integration with various data sources, and deploying models and results across different platforms. A demo was given showing how Revolution R Enterprise can be used to analyze large datasets using algorithms pre-parallelized for high performance.
Rapid final presentation the illuminatorsEd Tackett
1. The document describes a solar-powered street sign illuminator to help drivers more easily see residential street signs at night that are too high to be illuminated by car headlights.
2. The proposed solution is a device that can be retrofitted to existing street signs, using solar power to stay lit for around 10 hours and making signs more visible.
3. The design went through several iterations to improve lighting of dual signs, battery capacity, and durability for an estimated manufacturing cost of $101 per unit to sell at $200.
RSEP-QL: A Query Model to Capture Event Pattern Matching in RDF Stream Proces...Daniele Dell'Aglio
This document proposes a query model called RSEP-QL to capture event pattern matching in RDF stream processing languages. It presents RSEP-QL's data model of RDF streams and windows, basic operators like EVENT and SEQ, and evaluation semantics. The goal is to provide a reference model for comparing different RSP query languages and studying related problems in a standardized way.
The document discusses requirements and approaches for RDF stream processing (RSP). It covers the following key points in 3 sentences:
RSP aims to process continuous RDF streams to address scenarios like sensor data and social media. It involves querying streaming data, integrating streams with static data, and handling issues like imperfections. The document reviews existing RSP systems and languages, actor-based approaches, and the 8 requirements for real-time stream processing including keeping data moving, generating predictable outcomes, and responding instantaneously.
This document discusses RDF stream processing and the role of semantics. It begins by outlining common sources of streaming data on the internet of things. It then discusses challenges of querying streaming data and existing approaches like CQL. Existing RDF stream processing systems are classified based on their query capabilities and use of time windows and reasoning. The role of linked data principles and HTTP URIs for representing streaming sensor data is discussed. Finally, requirements for reactive stream processing systems are outlined, including keeping data moving, integrating stored and streaming data, and responding instantaneously. The document argues that building relevant RDF stream processing systems requires going beyond existing requirements to address data heterogeneity, stream reasoning, and optimization.
이 슬라이드는 개발자를 대상으로 CEP(Complex Event Processing) 소개하기 위한 것입니다.
소프트웨어 공학적인 입장에서 3가지 관점, 즉 범용성, 재사용성, 즉시응답성을 기준으로 CEP의 개념과 기능을 설명하고, 이를 구현한 Data Stream Query 기반 CEP 엔진의 작동방식을 설명합니다.
CEP는 새로운 개념이 아니고, 태초(?)부터 프로그래머들이 계속 해오던 일들입니다. CEP에 대해서 공부하고자 하는 개발자들에게 도움이 되었으면 합니다.
This document provides an overview of RDF stream processing and existing RDF stream processing engines. It discusses RDF streams and how sensor data can be represented as RDF streams. It also summarizes some existing RDF stream processing query languages and systems, including C-SPARQL, and the features they support like continuous execution, operators, and time-based windows. The document is intended as a tutorial for developers on working with RDF stream processing.
Presentation on RDF Stream Processing models given at the SR4LD tutorial (ISWC 2013) -- updated version at: http://www.slideshare.net/dellaglio/rsp2014-01rspmodelsss
Why do they call it Linked Data when they want to say...?Oscar Corcho
The four Linked Data publishing principles established in 2006 seem to be quite clear and well understood by people inside and outside the core Linked Data and Semantic Web community. However, not only when discussing with outsiders about the goodness of Linked Data but also when reviewing papers for the COLD workshop series, I find myself, in many occasions, going back again to the principles in order to see whether some approach for Web data publication and consumption is actually Linked Data or not. In this talk we will review some of the current approaches that we have for publishing data on the Web, and we will reflect on why it is sometimes so difficult to get into an agreement on what we understand by Linked Data. Furthermore, we will take the opportunity to describe yet another approach that we have been working on recently at the Center for Open Middleware, a joint technology center between Banco Santander and Universidad Politécnica de Madrid, in order to facilitate Linked Data consumption.
The document discusses data discovery, conversion, integration and visualization using RDF. It covers topics like ontologies, vocabularies, data catalogs, converting different data formats to RDF including CSV, XML and relational databases. It also discusses federated SPARQL queries to integrate data from multiple sources and different techniques for visualizing linked data including analyzing relationships, events, and multidimensional data.
The document discusses the need for a W3C community group on RDF stream processing. It notes there is currently heterogeneity in RDF stream models, query languages, implementations, and operational semantics. The speaker proposes creating a W3C community group to better understand these differences, requirements, and potentially develop recommendations. The group's mission would be to define common models for producing, transmitting, and continuously querying RDF streams. The presentation provides examples of use cases and outlines a template for describing them to collect more cases to understand requirements.
This document discusses Timbuctoo, an application designed for academic research that allows for complex and heterogeneous data. It explores archiving RDF datasets from Timbuctoo instances, including handling RDF graphs and triples, versioning datasets, and verifying dataset integrity and resolving links. A potential pipeline is proposed to ingest datasets from Timbuctoo into the EASY archive, but current Timbuctoo instances and datasets have obscure URIs and insufficient metadata, and the prototype pipeline lacks specifications. Archiving linked data from Timbuctoo could change the nature of preservation for archives.
Triplewave: a step towards RDF Stream Processing on the WebDaniele Dell'Aglio
The slides of my talk at INSIGHT Centre for Data Analytics (in NUI Galway) where I presented TripleWave (http://streamreasoning.github.io/TripleWave/), an open-source framework to create and publish streams of RDF data.
Video of the presentation can be seen here: https://www.youtube.com/watch?v=uxuLRiNoDio
The Data Source API in Spark is a convenient feature that enables developers to write libraries to connect to data stored in various sources with Spark. Equipped with the Data Source API, users can load/save data from/to different data formats and systems with minimal setup and configuration. In this talk, we introduce the Data Source API and the unified load/save functions built on top of it. Then, we show examples to demonstrate how to build a data source library.
Talk about Exploring the Semantic Web, and particularly Linked Data, and the Rhizomer approach. Presented August 14th 2012 at the SRI AIC Seminar Series, Menlo Park, CA
Open data is a crucial prerequisite for inventing and disseminating the innovative practices needed for agricultural development. To be usable, data must not just be open in principle—i.e., covered by licenses that allow re-use. Data must also be published in a technical form that allows it to be integrated into a wide range of applications. The webinar will be of interest to any institution seeking ways to publish and curate data in the Linked Data cloud.
This webinar describes the technical solutions adopted by a widely diverse global network of agricultural research institutes for publishing research results. The talk focuses on AGRIS, a central and widely-used resource linking agricultural datasets for easy consumption, and AgriDrupal, an adaptation of the popular, open-source content management system Drupal optimized for producing and consuming linked datasets.
Agricultural research institutes in developing countries share many of the constraints faced by libraries and other documentation centers, and not just in developing countries: institutions are expected to expose their information on the Web in a re-usable form with shoestring budgets and with technical staff working in local languages and continually lured by higher-paying work in the private sector. Technical solutions must be easy to adopt and freely available.
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
At SpotX, we have built and maintained a portfolio of Spark Streaming applications -- all of which process records in the millions per minute. From pure data ingestion, to ETL, to real-time reporting, to live customer-facing products and features, continuous applications are in our DNA. Come along with us as we outline our journey from square one to present in the world of Spark Streaming. We'll detail what we've learned about efficient processing and monitoring, reliability and stability, and long term support of a streaming app. Come learn from our mistakes, and leave with some handy settings and designs you can implement in your own streaming apps.
Presented by Landon Robinson and Jack Chapa
This document summarizes Jean-Paul Calbimonte's presentation on connecting stream reasoners on the web. It discusses representing data streams as RDF and using RDF stream processing systems. Key points include:
- RDF streams can be represented as sequences of timestamped RDF graphs.
- The W3C RSP community group is working to standardize RDF stream models and query languages.
- Producing RDF streams involves mapping live data sources to RDF and adding timestamps.
- Consuming RDF streams involves discovering stream metadata and endpoints to access the streams.
- Systems like TripleWave demonstrate approaches for spreading RDF streams on the web.
This is part 2 of the ISWC 2009 tutorial on the GoodRelations ontology and RDFa for e-commerce on the Web of Linked Data.
See also
http://www.ebusiness-unibw.org/wiki/Web_of_Data_for_E-Commerce_Tutorial_ISWC2009
This is part 2 of the ISWC 2009 tutorial on the GoodRelations ontology and RDFa for e-commerce on the Web of Linked Data.
See also
http://www.ebusiness-unibw.org/wiki/Web_of_Data_for_E-Commerce_Tutorial_ISWC2009
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
At SpotX, we have built and maintained a portfolio of Spark Streaming applications -- all of which process records in the millions per minute. From pure data ingestion, to ETL, to real-time reporting, to live customer-facing products and features, continuous applications are in our DNA. Come along with us as we outline our journey from square one to present in the world of Spark Streaming. We'll detail what we've learned about efficient processing and monitoring, reliability and stability, and long term support of a streaming app. Come learn from our mistakes, and leave with some handy settings and designs you can implement in your own streaming apps.
First Steps in Semantic Data Modelling and Search & Analytics in the CloudOntotext
This webinar will break the roadblocks that prevent many from reaping the benefits of heavyweight Semantic Technology in small scale projects. We will show you how to build Semantic Search & Analytics proof of concepts by using managed services in the Cloud.
RDF Linked Data - Automatic Exchange of BIM ContainersSafe Software
This presentation tells the story, and FME solutions of a Dutch Utility company for the automatic exchange of data containers containing RDF Linked data, BIM, and documents.
The presentation will focus on the non-traditional representation of RDF Linked Data and how this integrates with FME through SPARQL, Apache Jena, and a few customer-built transformers in FME.
This FME solution also uses my Excel switch-based method of directing the data flow (my presentation during the FME World Fair).
Similaire à RDF Stream Processing Models (RSP2014) (20)
This document discusses methods for distributed stream consistency checking against a conceptual model. It presents the problem of ensuring streaming data complies with an ontology model while dealing with noise and large volumes. Two methods - NTM and LN - are proposed and evaluated. The LN method models the negative inclusion axioms in the ontology as a pipeline of bolts, reducing the load on individual bolts compared to NTM and improving performance up to 300%. Future work is discussed around more expressive languages, inconsistency repair, and implementation on other stream processing engines.
The presentation I gave at Linköping University about web stream processing. I discuss two problems: (i) exchanging data streams on the web, and (ii) combining streams and contextual quasi-static data on the web
The talk I gave at the Stream Reasoning workshop in TU Berlin on December 8. I give an overview of RSEP-QL and how it can capture and formalise the behaviour of existing RSP engines, e.g. CSPARQL, EP-SPARQL, CQELS, SPARQLstream
Brief report about the contents of the Stream Reasoning workshop at SIWC 2016. Additional info about the event are available at: http://streamreasoning.org/events/sr2016
On Unified Stream Reasoning - The RDF Stream Processing realmDaniele Dell'Aglio
The presentation of my talk at WU Vienna on 18/2/2016. I discuss the problem of unifying existing solutions to process semantic streams - with a particular focus on the ones that perform continuous query answering over RDF streams
XSPARQL is a query language that allows querying of both XML and RDF data sources simultaneously. It extends the syntax of XQuery with a SPARQL-for clause to query RDF data and a CONSTRUCT clause to produce RDF output. XSPARQL 1.1 supports SPARQL 1.1 operators like aggregation, federation, negation and property paths. It also allows processing of JSON files. The XSPARQL evaluator takes an XSPARQL query, rewrites it, optimizes it, and executes it using XQuery and SPARQL engines to retrieve and combine data from different sources into a unified XML or RDF answer.
Augmented Participation to Live Events through Social Network Content Enrichm...Daniele Dell'Aglio
The document describes ECSTASYS, a system that captures social media content related to live events and enriches it to provide more context and value for event attendees. ECSTASYS retrieves tweets about an event, filters irrelevant ones, identifies event-related entities, associates tweets with specific event sub-topics, and visualizes the information organized by event. It uses a knowledge base derived from event schedules and ontologies to link tweets to the correct event components to provide a more holistic view of the complex live event through social media.
This document discusses an empirical study of RDF stream processing systems. The study aimed to understand why different systems can produce different outputs for the same inputs. Through experiments, the study found that differences could be explained by parameters like the starting time (t0) of windows in continuous queries. A more detailed model called SECRET was then developed to describe stream processing and help predict system outputs. This led to the CSR-bench benchmark for evaluating and comparing RDF stream reasoning systems.
This document presents a survey of temporal extensions of description logics (DLs) conducted by Daniele Dell'Aglio, Fariz Darari and Davide Lanti. It begins with an overview and outline of the topics that will be covered, including a running example to model how to become a doctor. The paper then surveys existing solutions for extending DLs with temporal aspects, including state-change based DLs, temporal DLs with an internal approach, point-based temporal DLs and interval-based temporal DLs. It concludes with a discussion of current hot topics and future directions for research on temporal extensions of DLs.
IMaRS - Incremental Materialization for RDF Streams (SR4LD2013)Daniele Dell'Aglio
This document discusses incremental materialization for RDF streams (IMaRS). IMaRS is an approach for incremental reasoning over sliding windows of RDF streams. It avoids recomputing the entire materialization when the window slides by tracking expiration times and computing only the changes (additions and removals) needed for the new materialization. The maintenance is done through execution of a logic program that uses contexts to build the delta sets for updating the materialization incrementally as new data enters the window.
Ontology based top-k query answering over massive, heterogeneous, and dynamic...Daniele Dell'Aglio
This document discusses ontology-based top-k continuous query answering over streaming data from multiple heterogeneous sources. It aims to investigate how ontologies and top-k queries can improve continuous query processing by exploiting ordering. The research will analyze state of the art solutions, define an evaluation framework, and assess the effects on correctness and performance of techniques that integrate stream reasoning and top-k queries. Preliminary results include an extension of an RDF stream processor testbench and a case study on real-time social media analytics.
This document discusses correctness in benchmarking RDF stream processors. It proposes a common model for the operational semantics of these systems called CSR and an extension to an existing benchmark called CSR-bench that focuses on correctness. CSR-bench includes an oracle to automatically validate correctness and a test suite. Experiments with three systems showed incorrect behaviors related to window initialization, slide parameters, window contents and timestamps. The work aims to improve understanding and assessment of these systems through a shared test environment.
Maven is a build automation tool that uses conventions over configurations. It utilizes a project object model (POM) file that defines project coordinates, dependencies, plugins, and repositories. Maven projects follow a standard directory structure and use lifecycles made up of phases to execute goals like compiling, testing, packaging, and deploying. It retrieves dependencies and plugins from repositories, caching artifacts locally for reuse.
The document discusses revision control systems and their main concepts and operations. It describes how revision control allows for backup of files, sharing of work, and cooperative development. The key operations covered are checkout, commit, update, and revert. It also discusses branches, tags, and distributed version control systems.
The document discusses unit testing and the JUnit framework. It defines unit testing as testing individual units or modules of code in isolation to determine if they work as expected. JUnit is introduced as a unit testing framework for Java. Key concepts covered include test cases, test fixtures, test suites, annotations for setup and teardown like @Before and @After, and best practices for test-driven development. Examples are provided of writing test cases using JUnit to test a TreeNode class and its methods.
The document discusses logging frameworks and describes key concepts such as declaration and naming of loggers, different logging levels, appenders for output, and layouts. It uses Log4j as an example logging framework and shows how to configure loggers, levels, and appenders in the properties file. Code examples are provided to illustrate logger declaration and usage.
Project Management Semester Long Project - Acuityjpupo2018
Acuity is an innovative learning app designed to transform the way you engage with knowledge. Powered by AI technology, Acuity takes complex topics and distills them into concise, interactive summaries that are easy to read & understand. Whether you're exploring the depths of quantum mechanics or seeking insight into historical events, Acuity provides the key information you need without the burden of lengthy texts.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Building Production Ready Search Pipelines with Spark and Milvus
RDF Stream Processing Models (RSP2014)
1. Tutorial on RDF
Stream Processing
M. Balduini, J-P Calbimonte, O. Corcho,
D. Dell'Aglio, E. Della Valle
http://streamreasoning.org/rsp2014
RDF Stream models
Continuous query models
Daniele Dell’Aglio
daniele.dellaglio@polimi.it
2. http://streamreasoning.org/rsp2014
Share, Remix, Reuse — Legally
This work is licensed under the Creative Commons
Attribution 3.0 Unported License.
Your are free:
to Share — to copy, distribute and transmit the work
to Remix — to adapt the work
Under the following conditions
Attribution — You must attribute the work by inserting
– “[source http://streamreasoning.org/rsp2014]” at the end of
each reused slide
– a credits slide stating
- These slides are partially based on “Tutorial on RDF stream
processing 2014” by M. Balduini, J-P Calbimonte, O. Corcho, D.
Dell'Aglio and E. Della Valle http://streamreasoning.org/rsp2014
To view a copy of this license, visit
http://creativecommons.org/licenses/by/3.0/
2
3. http://streamreasoning.org/rsp2014
Outline
1. Continuous RDF model extensions
• RDF Streams, timestamps
2. Continuous extensions of SPARQL
• Continuous evaluation
• Additional operators
3. Overview of existing systems
• Features
• Comparison
3
4. http://streamreasoning.org/rsp2014
Outline
1. Continuous RDF model extensions
• RDF Streams, timestamps
2. Continuous extensions of SPARQL
• Continuous evaluation
• Additional operators
3. Overview of existing systems
• Features
• Comparison
4
5. http://streamreasoning.org/rsp2014
Continuous extensions of RDF
As you know, “RDF is a standard model for data interchange on the
Web” (http://www.w3.org/RDF/)
<sub1 pred1 obj1>
<sub2 pred2 obj2>
We want to extend RDF to model data streams
A data stream is an (infinite) ordered sequence of data items
A data item is a self-consumable informative unit
5
7. http://streamreasoning.org/rsp2014
Data items and time
Do we need to associate the time to data items?
• It depends on what we want to achieve (see next!)
If yes, how to take into account the time?
• Time should not (but could) be part of the schema
• Time should not be accessible through the query language
• Time as object would require a lot of reification
How to extend the RDF model to take into account the time?
7
8. http://streamreasoning.org/rsp2014
Application time
A timestamp is a temporal identifier associated to a data item
The application time is a set of one or more timestamps
associated to the data item
Two data items can have the same application time
• Contemporaneity
Who does assign the application time to an event?
• The one that generates the data stream!
8
13. http://streamreasoning.org/rsp2014
Missing application time
A RDF stream without timestamp is an ordered sequence of
data items
The order can be exploited to perform queries
• Does Alice meet Bob before Carl?
• Who does Carl meet first?
9
S e1
:alice :isWith :bob
e2
:alice :isWith :carl
e3
:bob :isWith :diana
e4
:diana :isWith :carl
18. http://streamreasoning.org/rsp2014
Application time: point-based extension
One timestamp: the time instant on which the data item
occurs
We can start to compose queries taking into account the time
• How many people has Alice met in the last 5m?
• Does Diana meet Bob and then Carl within 5m?
10
e1 e2 e3 e4S
t3 6 91
:alice :isWith :bob
:alice :isWith :carl
:bob :isWith :diana
:diana :isWith :carl
23. http://streamreasoning.org/rsp2014
Application time: interval-based extension
Two timestamps: the time range on which the data item is
valid (from, to]
It is possible to write even more complex constraints:
• Which are the meetings the last less than 5m?
• Which are the meetings with conflicts?
11
S
t3 6 91
:alice :isWith :bob
:alice :isWith :carl
:bob :isWith :diana
:diana :isWith :carl
e1
e2
e3
e4
24. http://streamreasoning.org/rsp2014
Outline
1. Continuous RDF model extensions
• RDF Streams, timestamps
2. Continuous extensions of SPARQL
• Continuous evaluation
• Additional operators
3. Overview of existing systems
• Features
• Comparison
12
25. http://streamreasoning.org/rsp2014
Continuous query evaluation
From SPARQL
• One query, one answer
• The query is sent after that the data is available
To a continuous query language
• One query, multiple answers
• The query is registered in the query engine
• The registration usually happens before that the data arrives
• Real-time responsiveness is usually required
13
26. http://streamreasoning.org/rsp2014
Let’s process the RDF streams!
In literature there are two different main approaches to process
streams
Data Stream Management Systems (DSMSs)
• Roots in DBMS research
• Aggregations and filters
Complex Event Processors (CEPs)
• Roots in Discrete Event Simulation
• Search of relevant patterns in the stream
• Non-equi-join on the timestamps (after, before, etc.)
Current systems implements feature of both of them
• EPL (e.g. Esper, ORACLE CEP)
Now we focus on the CQL/STREAM model
• Developed in the DSMS research
• C-SPARQL (and others) is inspired to this model
14
27. http://streamreasoning.org/rsp2014
Our assumptions
In the following we will consider the following setting
• A RDF triple is an event
• Application time: point-based
<:alice :isWith:bob>:[1]
<:alice :isWith:carl>:[3]
<:bob :isWith :diana>:[6]
...
15
e1 e2 e3 e4S
t3 6 91
:alice :isWith :bob
:alice :isWith :carl
:bob :isWith :diana
:diana :isWith :carl
30. http://streamreasoning.org/rsp2014
Querying data streams – The CQL model
16
Streams Relations
…
<s,τ>
…
<s1>
<s2>
<s3>
infinite
unbounded
sequence
finite
bag
Mapping: T R
stream-to-relation
Stream Relation R(t)
Sliding windows
31. http://streamreasoning.org/rsp2014
Querying data streams – The CQL model
16
Streams Relations
…
<s,τ>
…
<s1>
<s2>
<s3>
infinite
unbounded
sequence
finite
bag
Mapping: T R
stream-to-relation
relation-to-relation
Stream Relation R(t)
Sliding windows
32. http://streamreasoning.org/rsp2014
Querying data streams – The CQL model
16
Streams Relations
…
<s,τ>
…
<s1>
<s2>
<s3>
infinite
unbounded
sequence
finite
bag
Mapping: T R
stream-to-relation
relation-to-relation
Stream Relation R(t)
Relational algerbra
Sliding windows
33. http://streamreasoning.org/rsp2014
Querying data streams – The CQL model
16
Streams Relations
…
<s,τ>
…
<s1>
<s2>
<s3>
infinite
unbounded
sequence
finite
bag
Mapping: T R
stream-to-relation
relation-to-stream
relation-to-relation
Stream Relation R(t)
Relational algerbra
Sliding windows
34. http://streamreasoning.org/rsp2014
Querying data streams – The CQL model
16
Streams Relations
…
<s,τ>
…
<s1>
<s2>
<s3>
infinite
unbounded
sequence
finite
bag
Mapping: T R
stream-to-relation
relation-to-stream
relation-to-relation
Stream Relation R(t)
Relational algerbra
*Stream operators
Sliding windows
64. http://streamreasoning.org/rsp2014
The query output
Which is the format of the answer?
We can distinguish two cases
1. No R2S operator: the output is a relation (that changes during
the time)
2. R2S operator: a stream.
– An RDF stream? It depends by the Query Form
22
S2R operators
R2S operators
SPARQL operators
RDF
Mappings
RDF
Streams
67. http://streamreasoning.org/rsp2014
R2S operator: stream
R2S operators
Three operators:
Rstream: streams out all data in the last step
Istream: streams out data in the last step that wasn’t on the previous step,
i.e. streams out what is new
Dstream: streams out data in the previous step that isn’t in the last step, i.e.
streams out what is old
24
CONSTRUCT RSTREAM {?a :prop ?b }
FROM ….
WHERE ….
…
<… :prop … > [t1]
<… :prop … > [t1]
<… :prop … > [t3]
<… :prop … > [t5]
< …:prop … > [t7]
…
RSP
query
stream
68. http://streamreasoning.org/rsp2014
Brief overview on the CEP operators
Sequence operators and CEP world
SEQ: joins eti,tf and e’ti’,tf’ if e’ occurs after e
EQUALS: joins eti,tf and e’ti’,tf’ if they occur simultaneously
OPTIONALSEQ, OPTIONALEQUALS: Optional join variants
25
e1 e2 e3
e4
S
3 6 91
Sequence Simultaneous
69. http://streamreasoning.org/rsp2014
Outline
1. Continuous RDF model extensions
• RDF Streams, timestamps
2. Continuous extensions of SPARQL
• Continuous evaluation
• Additional operators
3. Overview of existing systems
• Features
• Comparison
26
84. http://streamreasoning.org/rsp2014 84
The correctness problem (1)
Where are Alice and Bob,
when they are together?
S1 S2 S3 S4S
t3 6 91
:alice :isIn :hall
:bob :isIn :hall
:alice :isIn :kitchen
:bob :isIn :kitchen
85. http://streamreasoning.org/rsp2014 85
The correctness problem (1)
Where are Alice and Bob,
when they are together?
Let’s consider a tumbling
window W(ω=β=5)
S1 S2 S3 S4S
t3 6 91
:alice :isIn :hall
:bob :isIn :hall
:alice :isIn :kitchen
:bob :isIn :kitchen
86. http://streamreasoning.org/rsp2014 86
The correctness problem (1)
Where are Alice and Bob,
when they are together?
Let’s consider a tumbling
window W(ω=β=5)
Let’s execute the
experiment 4 times on C-
SPARQL
S1 S2 S3 S4S
t3 6 91
:alice :isIn :hall
:bob :isIn :hall
:alice :isIn :kitchen
:bob :isIn :kitchen
87. http://streamreasoning.org/rsp2014 87
The correctness problem (1)
Where are Alice and Bob,
when they are together?
Let’s consider a tumbling
window W(ω=β=5)
Let’s execute the
experiment 4 times on C-
SPARQL
Execution 1° answer 2° answer
1 :hall [6] :kitchen [11]
2 :hall [5] :kitchen [10]
3 :hall [6] :kitchen [11]
4 - [7] - [12]
S1 S2 S3 S4S
t3 6 91
:alice :isIn :hall
:bob :isIn :hall
:alice :isIn :kitchen
:bob :isIn :kitchen
88. http://streamreasoning.org/rsp2014 88
The correctness problem (1)
Where are Alice and Bob,
when they are together?
Let’s consider a tumbling
window W(ω=β=5)
Let’s execute the
experiment 4 times on C-
SPARQL
Execution 1° answer 2° answer
1 :hall [6] :kitchen [11]
2 :hall [5] :kitchen [10]
3 :hall [6] :kitchen [11]
4 - [7] - [12]
S1 S2 S3 S4S
t3 6 91
:alice :isIn :hall
:bob :isIn :hall
:alice :isIn :kitchen
:bob :isIn :kitchen
Which is the correct answer?
103. http://streamreasoning.org/rsp2014 103
R2R operator
The window operator (through SECRET)
S3
S4 S5
S6
S7
S8
S9 S10
S11
S12
S
S1
S2
W(ω,β)
β
ω
t0: When does the
window start?
(internal window
param)
TICK: When are
data stream
elements added to
the window?
Triple-based vs
graph-based
t
104. http://streamreasoning.org/rsp2014 104
R2R operator
The window operator (through SECRET)
S3
S4 S5
S6
S7
S8
S9 S10
S11
S12
S
S1
S2
W(ω,β)
β
ω
t0: When does the
window start?
(internal window
param)
TICK: When are
data stream
elements added to
the window?
Triple-based vs
graph-based
REPORT: When is the window content
made available to the R2R operator?
Non-empty content, Content-change,
Window-close, Periodic
t
105. http://streamreasoning.org/rsp2014
Understanding the RSPs
They share similar models, but they behave in different ways
The C-SPARQL, CQELS and SPARQLstream models does not allow
to determine in a unique way which should be the answer given
the inputs and the query
• There are missing parameters (encoded in the implementations)
Why is it important to understand those behaviours?
• To assess the correct implementation of the systems
• To improve the comprehension of the benchmarking
W3C RDF stream processor community group started to jointly
work out a recommendation in 2014
http://www.w3.org/community/rsp/
35
106. http://streamreasoning.org/rsp2014
References
DSMSs and CEPs
• Arasu, A., Babu, S., Widom, J.: The CQL continuous query language : semantic foundations. The
VLDB Journal 15(2) (2006) 121–142
• Gianpaolo Cugola, Alessandro Margara: Processing flows of information: From data stream to
complex event processing. ACM Comput. Surv. 44(3): 15 (2012)
• Botan, I., Derakhshan, R., Dindar, N., Haas, L., Miller, R.J., Tatbul, N.: Secret: A model for analysis
of the execution semantics of stream processing systems. PVLDB 3(1) (2010) 232–243
RDF Stream Processors
• Barbieri, D.F., Braga, D., Ceri, S., Della Valle, E., Grossniklaus, M.: C-SPARQL: A continuous query
language for RDF data streams. IJSC 4(1) (2010) 3–25
• Calbimonte, J.P., Jeung, H., Corcho, O., Aberer, K.: Enabling Query Technologies for the Semantic
Sensor Web. IJSWIS 8(1) (2012) 43–63
• Le-Phuoc, D., Dao-Tran, M., Xavier Parreira, J., Hauswirth, M.: A native and adaptive approach for
unified processing of linked streams and linked data. In: ISWC. (2011) 370–388
• Anicic, D., Fodor, P., Rudolph, S., Stojanovic, N.: EP-SPARQL: a unified language for event
processing and stream reasoning. In: WWW. (2011) 635–644
Benchmarks and RSP comparison
• Ying Zhang, Minh-Duc Pham, Óscar Corcho, Jean-Paul Calbimonte: SRBench: A Streaming
RDF/SPARQL Benchmark. International Semantic Web Conference (1) 2012: 641-657
• Danh Le Phuoc, Minh Dao-Tran, Minh-Duc Pham, Peter A. Boncz, Thomas Eiter, Michael Fink: Linked
Stream Data Processing Engines: Facts and Figures. International Semantic Web Conference (2)
2012: 300-312
• Daniele Dell'Aglio, Jean-Paul Calbimonte, Marco Balduini, Óscar Corcho, Emanuele Della Valle: On
Correctness in RDF Stream Processor Benchmarking. International Semantic Web Conference (2)
2013: 326-342
36
107. Tutorial on RDF
Stream Processing
M. Balduini, J-P Calbimonte, O. Corcho,
D. Dell'Aglio, E. Della Valle
http://streamreasoning.org/rsp2014
RDF Stream models
Continuous query models
Daniele Dell’Aglio
daniele.dellaglio@polimi.it