Semantic Web Investigation within Big Data Context
1. Semantic Web Investigation within Big Data
Context
Murad Daryousse
Damascus University-FITE
Abstract
Data is everywhere; nearly everything can be represented
by a number. In addition, in its simple form, data is pure,
a collection of measured information that, when analyzed
and processed, tells a story backed by numerical truth. On
other hand challenges associated with (5V’s) volume,
variety, velocity, veracity, and value of this data need to
be addressed when we process, analyze, and ultimately
derive insight from data. Data that characterised by this
5V’s is called “Big Data”, and we discuss in this research
how the Semantic Web - as a platform - can be utilized
to address challenges that associated with each of Big
Data characteristics. We organize our work as a state of
the art of works and researches in the same context.
Keywords: Big Data, semantic web, linked data, state of
the art.
1 Introduction
Recently, Big Data has made its appearance in the shared
mindset of researchers, practitioners, and funding
agencies, driven by the awareness that concerted efforts
are needed to address 21st century data collection,
analysis, management, ownership, and privacy issues.
While there is no generally agreed understanding of what
exactly is (or more importantly, what is not) Big Data, an
increasing number of V’s has been used to characterize
different dimensions and challenges of Big Data: volume,
velocity, variety, value, and veracity. Interestingly,
different (scientific) disciplines highlight certain
dimensions and neglect others. For instance,
supercomputing seems to be mostly interested in the
volume dimension while researchers working on sensor
webs and the internet of things seem to push on the
velocity front. The social sciences and humanities, in
contrast, are more interested in value and veracity. The
variety dimensions seems to be the most intriguing one
for the Semantic Web and the one where we can
contribute most as a research community (Hitzler, et al.,
2013).
At the end, all V’s have to be addressed in an
interdisciplinary effort to substantially advance on the Big
Data front. The 4th Paradigm of Science is yet another
notion that has emerged within the last years and can be
understood as the scientific view on how Big Data
changes the very fabric of science. With the omnipresence
and availability of data from different times, locations,
perspectives, topics, cultures, resolutions, qualities, and
so forth, exploration becomes an additional (4th)
paradigm of science. This raises synthesis to a new level.
In other words, we can gain new insights by creatively
combining what is already there – an idea that seems to
align very well with Linked Data and Semantic Web
technologies as drivers of integration (Hitzler, et al.,
2013).
2 Characteristics of Big Data
We discuss the primary characteristics of the Big Data
problem as is pertain to the 5V’s.
2.1 Volume
Volume dimension of Big Data relates to the size of data
from one or more data sources in Tera, Peta, or Exabyte
(Anjomshoaa, et al., 2014). The sheer volume of data
being stored today is exploding. Of course, a lot of the
data that’s being created today isn’t analyzed at all
(Eaton, et al., 2012). Some expectations point to this
number to reach 35 Zettabyte (ZB) by 2020. Twitter
alone generate more than 7 terabytes (TB) of data every
day, Facebook 10 TB, and some enterprises generate
terabytes of data every hour of every day of the year
(Eaton, et al., 2012). We are going to stop right there with
the factoids: Truth is, these estimates will be out of date
by the time you read this paper. However, availability of
fine-grained raw data is not sufficient unless we can
analyze, summarize or abstract them in meaningful ways
2. that are actionable (Thirunarayan, et al., 2014).
However, we still need to investigate how to effectively
translate large amounts of raw data into a few human
comprehensible nuggets of information necessary for
decision-making. Furthermore, privacy and locality
considerations require moving computations closer to the
data source, leading to powerful applications on
resource-constrained devices. In the latter situation, even
though the amount of data is not large by normal
standards, the resource constraints negate the use of
conventional data formats and algorithms, and instead
necessitate the development of novel encoding, indexing,
and reasoning techniques (Thirunarayan, et al., 2014). In
summary, the volume of data to be processed on available
resources creates the following challenges: (1) Ability to
abstract the data in a form that summarizes the situation
and is actionable, that is, semantic scalability (Sheth,
2011)(Sheth, 2013) to transcend from fine-grained
machine-accessible data to coarse-grained human
comprehensible and actionable abstractions; and (2)
Ability to scale computations to take advantage of
distributed processing infrastructure and to reason
efficiently on mobile devices where appropriate.
2.2 Variety
Data today exists in various formats like texts, images,
videos, audios, relational data, and so on. Quite simply,
variety represents all types of data (Eaton, et al., 2012),
with explosion of sensors, and smart devices, as well as
social collaboration technologies, data has become more
complex, because it includes not only traditional
relational data, but also raw, semi structured, and
unstructured data from web pages, web log files
(including click stream data), search indexes, social
media, e-mail, documents, sensor data from active and
passive systems, and so on. A fundamental shift in
analysis requirements from traditional structured data to
include raw, semi structured, and unstructured data as a
part of decision-making and insight process. So
traditional analytic platforms cannot handle variety
because it designed for only handle traditional structured
(mostly relational) data. The truth of the matter is that
80% of world’s data is unstructured or semi structured at
best (Eaton, et al., 2012). However, the value of Big Data
can be realized when we able to draw insights from the
various kinds of data available to us, which include both
traditional and nontraditional. On the other hand,
available knowledge that can be drawn from data has a
mix of declarative and statistical flavor, capturing both
qualitative and quantitative aspects that when integrated
can provide complementary and Corroborative
information (Sheth, et al., 2012). In summary, the variety
in data formats and the nature of available knowledge
creates the following challenges: (1) Ability to integrate
and interoperate with heterogeneous data (to bridge
syntactic diversity, local vocabularies and models, and
multimodality); and (2) Semantic scalability
(Thirunarayan, et al., 2014).
2.3 Velocity
The conventional understanding of velocity typically
considers how quickly the data is arriving and stored, and
its associated rates of retrieval (Eaton, et al., 2012). This
definition of velocity is nothing more than one of the
reasons of data volumes that we are looking at, which
make it as one of Big Data volume’s characteristics. We
believe the idea of velocity, within Big Data context, is
actually something far more compelling than this
conventional definition. We are in agreement that today’s
enterprises are dealing with petabytes of data instead of
terabytes, and the increase sensors and other information
streams has led to a constant flow of data at a pace that
has made it impossible for traditional systems to handle.
Sometimes, getting an edge over your competition can
mean identifying a trend, problem, or opportunity only
seconds, or even microseconds, before someone else
(Eaton, et al., 2012). In addition, more and more of the
data being produced today has a very short shelf life, so
we must be able to analyze this data in near real-time if
we hope to find insights in this data. After all, the velocity
idea doesn’t mean only the speed of data generating and
storing, but the time required to exploit it too. The
importance lies in the speed of the feedback loop
(Dumbill, 2012), taking data from input through to
decision. To accommodate velocity, a new way of
thinking about a problem must start at the inception point
of the data. This requires online algorithms to efficiently
crawl and filter relevant data sources, detect and track
events, and anomalies, and collect and update relevant
background knowledge (Thirunarayan, et al., 2014).
Another key challenge is the creation of relevant domain
model or domain ontology on demand quickly to be
useful for semantic searching, browsing, and analysis of
real-time content. In summary, the rapid change in data
and trends creates the following challenges: (1) Ability to
focus on and rank the relevant data; (2) Ability to process
data quickly (such as incrementally) and respond; and (3)
Ability to cull, evolve, and hone in on relevant
background knowledge.
2.4 Veracity
Generally, Big Data characterized according to previous
three V’s (Volume, Variety, and Velocity), but we do
think that Big Data can be better explained and
3. characterized by adding a few more V’s. This V’s explain
important aspects of Big Data and Big Data strategy that
we cannot ignore. One of this V’s is Veracity, where
having a lot of data in different volumes coming in at high
speed is worthless if that data is incorrect or incomplete.
Incorrect data can cause a lot of problems for
organizations as well as for consumers. Therefore,
veracity means to what degree we can sure about
correctness and trustworthiness of data that coming from
many different heterogeneous sources. Statistical
methods can be applied in the context of homogeneous
data, while semantic models are necessary for
heterogeneous data (Thirunarayan, et al., 2014). In
summary, determination of veracity of data creates the
following challenges: (1) Ability to detect anomalies and
inconsistencies in data that can be due to defective
sensors or anomalous situations; and (2) Ability to reason
about and with trustworthiness that exploits temporal
history, collective evidence, context, and conflict
resolution strategies for decision-making.
2.5 Value
Another additional V that characterize Big Data is Value,
where the ultimate goal of Big Data is to get a value from
it. Of course, data in itself is not valuable at all. The value
is in the analyses done on that data and how the data is
turned into information and eventually turning it into
knowledge. The value is in how we will use that data and
turn our organisation into an information-centric
company that relies on insights derived from data
analyses for their decision-making. A key challenge to
get this value is the acquisition, identification (e.g.,
relevant knowledge on Linked Open Data (LOD)),
construction and application of relevant background
knowledge needed for data analytics and prediction
(Thirunarayan, et al., 2014). This does not mean ignoring
statistical techniques as apart of extracting value process
from data, actually semantic and statistical approaches
are complementary, and they have mutual benefits. For
example, we can use statistical techniques and
declarative knowledge as a hybrid approach in many
situations (Perera , et al., 2013), where there is a need for
filling gaps in existing declarative knowledge by
statistical techniques, and in contrast, we can use this
declarative knowledge for error detection and correction,
and compensation incomplete data. In summary,
extracting value using data analytics creates the
following challenges: (1) Ability to acquire and apply
knowledge from data and integrate it with domain
ontology; and (2) Ability to learn and apply domain
models from novel data streams for classification,
prediction, decision-making, and personalisation.
3 Role of the semantic web in the
creation of value
As we mentioned previously, the ultimate goal of Big
Data is creating a value by processing and analysing this
data. We have noticed the presence of a strong
relationship between challenges organized around Big
Data 5V’s, and the need for dealing with knowledge and
semantics for addressing this challenges. The question
here is how this 5V’s and their related challenges are
reflected on the process of value creation, therefore data
analysis process, and how we can enable semantic web
concepts and technologies to overcome this challenges
and get the desired value at the end? In order to answer
the question we have to know that currently there is a
wide gap between Big Data analysis potential and its
realization. Below we will explain all phases of the
pipeline that can create value from data, trying to address
related challenges with the mentality of the semantic
web.
3.1 Big Data analysis pipeline
Before we can get a value from data, there are a number
of distinct phases that should data pass through as shown
in figure 1 below. Each phase has its own challenges, and
there are common challenges between all these phases.
Heterogeneity, scale, timeliness, complexity, and privacy
problems with Big Data impede progress at all phases of
the pipeline that can create value from data (Agrawal, et
al., 2012).
Figure 1: the Big Data analysis pipeline.
The problems start right away during data acquisition,
when the data tsunami requires us to make decisions,
currently in an ad hoc manner, about what data to keep
and what to discard, and how to store what we keep
reliably with the right metadata. Much data today is not
natively in structured format; for example, tweets and
blogs are weakly structured pieces of text, while images
4. and video are structured for storage and display, but not
for semantic content search (Agrawal, et al., 2012).
Transforming such content into a structured format for
later analysis is a major challenge. The value of data
explodes when it can be linked with other data, thus data
integration is a major creator of value. The semantic web
plays a big role here, where we can use its concepts and
standards like ontology and linked data principles to
realize this data integration and linkage. Below we will
discuss this in more details.
3.1.1 Data acquisition and recording
Big Data does not arise out of a vacuum: it is recorded
from some data generating sources (Agrawal, et al.,
2012). For example, consider our ability to sense and
observe the world around us, from the heart rate of an
elderly citizen, and presence of toxins in the air we
breathe, to the planned square kilometer array telescope,
which will produce up to 1 million terabytes of raw data
per day. Similarly, scientific experiments and simulations
can easily produce petabytes of data today. This is what
so-called Volume and Velocity as we mentioned
previously, but much of this data is of no interest, and it
can be filtered and compressed by orders of magnitude.
One challenge is to define these filters in such a way that
they do not discard useful information. We need research
in the science of data reduction that can intelligently
process this raw data to a size that its users can handle
while not missing the needle in the haystack.
Furthermore, we require “on-line” analysis techniques
that can process such streaming data on the fly, since we
cannot afford to store first and reduce afterward. The
second big challenge is to automatically generate the
right metadata to describe what data is recorded and how
it is recorded and measured (Agrawal, et al., 2012). By
defining or curating domain ontologies as a conceptual
coverage for generated data, we can define such semantic
filters. Furthermore, to address volume issues, we can use
these ontologies to change the level of abstraction for
data processing to information that is meaningful to
human activity, actions, and decision-making. This is
what so-called semantic perception (Henson, et al.,
2011). Similarly, generating right meta data can be done
by relaying on this ontology for further analysis steps.
Besides using manually curated ontologies and reasoners
as discussed above, Linked Open Data (LOD) and
Wikipedia can be harnessed to overcome syntactic and
semantic heterogeneity with applications from social
media to Internet of Things. On the other hand, to
addressing velocity we need to deal with continuous
semantics. Formal modeling of evolving, dynamic,
domains and events is hard (Thirunarayan, et al., 2014).
First, we do not have many existing ontologies to use as
a starting point. Second, diverse users will have difficulty
committing to the shared worldview, further exacerbated
by contentious topics. Building domain models for
consensus requires us to pull background knowledge
from trusted, uncontroversial sources. Here, we can
harvest the wisdom of the crowds, or collective
intelligence, to build lightweight ontology (an informal
domain model) for use in tracking unfolding events, by
classifying, annotating and analyzing streaming data.
Therefore, we have to do more research about dynamic
creation and updating of ontologies from social-
knowledge sources such as Wikipedia and LOD that offer
exciting new capabilities in making real-time social and
sensor data more meaningful and useful for advanced
situational-awareness, analysis and decision-making.
3.1.2 Data analysis prerequisites
Data analysis require information extraction and data
integration, aggregation, representation, and cleaning
before we can analyse data effectively. Frequently, the
information collected will not be in a format ready for
analysis. For example, consider the collection of
electronic health records in a hospital, comprising
transcribed dictations from several physicians, structured
data from sensors and measurements (possibly with some
associated uncertainty), and image data such as x-rays
(Agrawal, et al., 2012), and this is what so called data
variety. We cannot leave the data in this form and still
effectively analyze it. Rather we require an information
extraction process that pulls out the required information
from the underlying sources and expresses it in a
structured form suitable for analysis. Doing this correctly
and completely is a continuing technical challenge. Note
that this data also includes images and will in the future
include video; such extraction is often highly application
dependent (e.g., what you want to pull out of an MRI is
very different from what you would pull out of a picture
of the stars, or a surveillance photo). Furthermore, we are
used to thinking of Big Data as always telling us the truth,
but this is actually far from reality. Existing work on data
cleaning assumes well-recognized constraints on valid
data or well-understood error models; for many emerging
Big Data domains, these do not exist and this what we
called data veracity. Given the heterogeneity of the flood
of data, it is not enough merely to record it and throw it
into a repository (Agrawal, et al., 2012). With adequate
metadata, there is some hope, but even so, challenges will
remain due to differences in information details and in
data record structure. Data analysis is considerably more
challenging than simply locating, identifying,
understanding, and citing data. This requires differences
5. in data structure and semantics to be expressed in forms
that are computer understandable, and then “robotically”
resolvable (Agrawal, et al., 2012). If we have high quality
semantic meta data from data acquisition and recording
phase, then we suggests to investigate linked data
principles for data representation. In other words, we can
use RDF formalism to representing, integrating,
interoperating, structuring, and linking data as a graph of
<subject, predicate, object> triples. This formalism can
help in addressing variety issues of Big Data by
representing it as a highly structured, machine-readable
format. We can do that by investigate domain ontology
as a background knowledge, in addition to benefiting
from linked open data (e.g., DBPedia) for linking,
integrating, and disambiguating our triples. The
remaining cleaning issues can be addressed by gleaning
trustworthiness, where this may require exploring robust
domain ontologies and other information like context,
history, correlations, and meta data that can distinguish
between erroneous data and data that caused by an
abnormal situation. On the other hand, data provenance
tracking and representation can be the basis for gleaning
trustworthiness (Manuel, et al., 2010). Unfortunately,
there is neither a universal notion of trust that is
applicable to all domains nor a clear explication of its
semantics or computation in many situations. The Holy
Grail of trust research is to develop expressive trust
frameworks that have both declarative-axiomatic and
computational specification, and to devise
methodologies for instantiating them for practical use, by
justifying automatic trust inference in terms of
application-oriented semantics of trust (Anantharam, et
al., 2013).
3.1.3 Query processing, data modeling, and
analysis
We have to think about querying and mining Big Data in
different mentality, where traditional query languages
(e.g., SQL, SPARQL, and even NoSQL) and statistical
analysis methods is not enough for realising the desired
value. Big Data, as we mentioned, is often noisy,
dynamic, heterogeneous, inter-related and untrustworthy.
Further, interconnected Big Data (in linked data style)
forms large heterogeneous information networks, with
which information redundancy can be explored to
compensate for missing data, to crosscheck conflicting
cases, to validate trustworthy relationships, to disclose
inherent clusters, and to uncover hidden relationships and
models. Actually data mining and analysis is a cyclic
process, where mining requires integrated, cleaned,
trustworthy, and efficiently accessible data, declarative
query and mining interfaces, scalable mining algorithms,
and Big Data computing environments. At the same time,
data mining itself can also be used to help improve the
quality and trustworthiness of the data, understanding its
semantics, and provide intelligent query functions. The
value of Big Data analysis can only be realized if it
applied under these difficult conditions (Agrawal, et al.,
2012). On the flip side, knowledge developed from data
can help in correcting errors and removing ambiguity.
Big Data is also enabling the next generation of
interactive data analysis with real-time answers, where
scaling complex query processing techniques to terabytes
while enabling interactive response times is a major open
research problem today (Agrawal, et al., 2012). In the
context of RDF data representation, we need new
methods that enable a tight coupling between declarative
query languages and the functions of analysis and
mining, and these will benefit both expressiveness and
performance of the analysis.
3.1.4 Interpretation
Having the ability to analyze Big Data is of limited value
if users cannot understand the analysis (Agrawal, et al.,
2012). Ultimately, a decision-maker, provided with the
result of analysis, has to interpret these results. This
interpretation cannot happen in a vacuum. Usually, it
involves examining all the assumptions made and
retracing the analysis. Furthermore, there are many
possible sources of error: computer systems can have
bugs, models almost always have assumptions, and
results can be based on erroneous data. For all of these
reasons, no responsible user will cede authority to the
computer. Rather she will try to understand, and verify,
the results produced by the computer. The computer
system must make it easy for her to do so. This is
particularly a challenge with Big Data due to its
complexity. In short, it is rarely enough to provide just
the results. Rather, one must provide supplementary
information that explains how each result was derived,
and based upon precisely what inputs. Such
supplementary information is called the provenance of
the (result) data. By studying how best to capture, store,
and query provenance, in conjunction with techniques to
capture adequate metadata, we can create an
infrastructure to provide users with the ability both to
interpret analytical results obtained and to repeat the
analysis with different assumptions, parameters, or data
sets. With our semantic mentality, we think about that the
best way to provide such provenance is relaying on
domain ontologies and LOD as a proof framework.
Representing domain knowledge as ontology using
semantic web standard can help us to justify analysis
6. results. In other words, analysis results must be justifiable
by used domain knowledge.
4 Conclusion
We have entered an era of Big Data. Through better
analysis of the large volumes of data that are becoming
available, there is the potential for making faster
advances in many scientific disciplines and improving
the profitability and success of many enterprises.
However, in this paper we have investigated how the
semantic web can be an enabler for addressing many
aspects of Big Data challenges like heterogeneity, lack of
structure, error-handling, timeliness, and provenance at
all stages of the analysis pipeline from data acquisition to
result interpretation. These challenges are common
across a large variety of application domains, and
therefore not cost-effective to address in the context of
one domain alone. Actually, there are many challenges,
which are out of scope of this paper, that need to be
addressed before Big Data potential can be realized fully.
As a result, semantic web concepts and technologies can
play a major role, as a mediation layer between Big Data
as is and transforming it ultimately to a big value. Finally
we must support and encourage fundamental research
towards addressing these challenges in different
mentalities if we are to achieve the promised benefits of
Big Data.
References
An ontological approach to focusing attention and
enhancing machine perception on the Web [Journal] /
auth. Henson Cory , Thirunarayan Krishnaprasad and
Sheth Amit . - Amsterdam : Applied Ontology, 2011. -
4 : Vol. 6.
Big Data Now [Book] / auth. Dumbill Edd. - United
States of America : O’Reilly Media, Inc., 2012.
Big data: The next frontier for innovation,
competition, and productivity. [Book] / auth. Manyika
James [et al.]. - [s.l.] : McKinsey Global Institute, 2011.
Challenges and Opportunities with Big Data: A
white paper prepared for the Computing
Community Consortium [Report] / auth. Agrawal
Divyakant [et al.]. - USA : [s.n.], 2012.
Comparative trust management with applications:
Bayesian approaches emphasis [Journal] / auth.
Thirunarayan Krishnaprasad [et al.] // ScienceDirect. -
2014. - pp. 182–199.
Linked Data, Big Data, and the 4th Paradigm
[Journal] / auth. Hitzler Pascal and Janowicz
Krzysztof // IOS Press. - [s.l.] : IOS Press, 2013.
Provenance and Trust [Online] / auth. Manuel José
and Pérez Gómez // Slide Share. - Jun 30, 2010. -
http://www.slideshare.net/jmgomez23/provenance-and-
trust.
Semantics Driven Approach for Knowledge
Acquisition From EMRs [Journal] / auth. Perera Sujan
[et al.] // IEEE Journal of Biomedical and Health
Informatics. - 2013. - pp. 515 - 524.
Semantics Empowered Web 3.0: Managing
Enterprise, Social, Sensor, and Cloud-based Data
and Services for Advanced Applications. [Book] /
auth. Sheth A. and Thirunarayan K.. - [s.l.] : Morgan &
Claypool, 2012.
Semantics-Empowered Approaches to Big Data
Processing for Physical-Cyber-Social Applications
[Book] / auth. Thirunarayan Krishnaprasad and Sheth
Amit. - Dayton : AAAI, 2014.
Towards Semantic Mashup Tools for Big Data
[Book] / auth. Anjomshoaa Amin, Tjoa A. Min and
Hendrik. - Bali : Springer Berlin Heidelberg, 2014.
Traffic Analytics using Probabilistic Graphical
Models Enhanced with Knowledge Bases.
Proceedings of the 2nd International Workshop on
Analytics for Cyber-Physical Systems (ACS-2013)
[Conference] / auth. Anantharam Pramod ,
Thirunarayan Krishnaprasad and Sheth Amit // Ohio
Center of Excellence in Knowledge-Enabled
Computing. - 2013.
Understanding Big Data [Book] / auth. Eaton Chris [et
al.]. - USA : McGraw-Hill, 2012.
Using Data for Systemic Financial Risk Management
[Conference] / auth. Flood Mark [et al.] // Proc. Fifth
Biennial Conf. Innovative Data Systems . - 2011.