Scei technical whitepaper-19.06.2012

SCEI(Semantic Communication Engine Innsbruck)
pronounced SKY
Technical Whitepaper
Dieter Fensel, Michael Fried, Christoph Fuchs, Iker Larizgoitia, Alex Oberhauser, Stefan Thaler, Ioan
Toma
v 0.6
19.06.2012
Abstract
1. Introduction
2. Problem definition
3. Reference architecture
3.1. Semantic layer/domain knowledge
3.2. Separation of components
3.3. Data and content storage
3.4. The weaving process in general
4. Reference implementation
4.1. Content Management System
4.1.1. Domain and task specific UI
4.1.2. Workflow engine and communication patterns
4.1.3. Export of RDF data (OWLIM Integration)
4.1.4. The Weaving Process within the CMS
4.1.4.1. Publication in CMS
4.1.4.2. Feedback collection in CMS
4.1.4.3. Statistics collection in CMS
4.2. dacodi
4.2.1. The Weaving Process within dacodi
4.2.1.1. Common Weaver Model
4.2.1.2. Publication in dacodi
4.2.1.3. Feedback Collection in dacodi
4.2.1.4. Statistics Collection in dacodi
4.2.3. Adapters
1

Abstract
The Semantic Communication Engine Innsbruck (SCEI) is a fully fledged online communication
software suite. It supports users in online communication, gathering feedback and measuring
online impact. The software contains workflow assistance as well as communication patterns
supporting the planning and execution of online campaigns. Furthermore we look into
possibilities of integrating crowdsourcing support like for example the translation of texts into
foreign languages. In particular, we enable fast and easy one-click publishing and collection
of content on a multitude of marketing channels, hiding technology complexity behind a user-
friendly interface, and directly reflecting on the impact within online communities and web
presence. The core idea of our approach is to introduce a semantic layer on top of the various
Internet based communication channels that is domain specific (e.g. tourism, hotels, marketing,
agencies, etc.) and not channel specific. This document describes the overall technical
architecture of the multi channel management and communication software. Furthermore we
present motivation, some use cases and architectural diagrams that outline the implementation
details.
Note: In this version of the document we mainly focus on the publication, feedback and
statistics core functionality of SCEI together with an overview of the semantic layer. In
later versions we will also cover workflow capabilities, communication patterns as well
as the crowdsourcing components.
2

1. Introduction
Today’s online world is more than ever driven by the fast paced exchange of information.
The rise of Facebook, YouTube and others resulted in a notable shift how companies and
individuals share and exchange information. These social media platforms and online services
enable everyone to interact with a huge, already established user base. While a “traditional”
online presence in form of a company or personal web page is still relevant, the inherent
recommendation mechanics of social media platforms are beneficial to reach potential
customers. However, online information is not exclusively for human consumption. Using
semantic technologies information can be enriched with metadata, making it readable for
machines as well.
This inherent difference of the traditional, social, and machine readable way to make
information available is essential in regards to how the information is treated.
While a traditional web page has many advantages in terms of content ownership and freedom
of presentation, there are usually limited metrics which indicate if the presented information
was appreciated by a visitor. Unless special mechanisms (such as a rating or feedback system)
are implemented, visitor numbers, as well as geographic data are the only metrics available.
Social media platforms, on the other hand, provide very simple feedback mechanisms which
are usually unobtrusive. Further, communication is encouraged by providing an easy way to
exchange messages between users. The emphasis on feedback and interaction is the main
difference of how information is treated on a social media platforms as opposed to traditional
web pages. Analyzing the accumulated feedback is a useful indicator to see if the published
information was received well by the audience or not. Even more broadly this enables the
steering of brand perception and things like a holistic online reputation management and
customer relationship management.
Traditional web pages and social media platforms concentrate on humans as their main data
consumers. The rise of various services (such as web services or mobile applications) and
publishing methods such as linked data provide an incentive to present the data in machine
readable form as well.
Our goal is to develop a set of tools which combines the traditional, social and machine
readable way to interact with information and makes this process easier than it is with existing
tools. To reach this goal we will develop a unified layer - dacodi - which is able to interact with
social media platforms, and extend existing content management systems (starting with Drupal)
to incorporate the social and machine readable aspect into existing solutions. Additionally we
will develop support for defining workflows as well as identify communication patterns that help
in the planning, execution and controlling of online campaigns. What differentiates our solution
is the introduction of a semantic layer that abstracts information items and underlying concepts
from the concrete channels that the user wants to manage. This semantic layer is specific to
3

a domain (e.g. Hotels, Restaurants, Doctors, Event managers, ...) which shall enable users to
work from a conceptual view rather than a channel view. Throughout this paper those software
components are referred to as the Semantic Communication Engine Innsbruck, or SCEI.
Figure 1: SCEI conceptual overview
The aforementioned differentiation (traditional, social, machine readable) allows us to separate
responsibilities of our software components, making the whole SCEI modular, which results in
higher efficiency, robustness and scalability. Obviously, this separation is not strict and the three
variants can overlap. Certain types of web pages combine various technologies and paradigms
which do not allow a strict classification. However, this three-fold separation is not meant as a
classification of current online information. Its purpose is to define the types of information with
which the SCEI interacts.
The aim of this document is to introduce our technical solution to the problem of online multi
channel management. Before defining the general problem in detail, we are going to agree
upon certain terms to define a clear terminology. After the problem description in Section 2 we
will present the high level approach of our solution i.e. the reference architecture in Section
3 followed by a more detailed and technical look into the software components i.e. reference
architecture in Section 4. This will comprise the separation of the system into two big parts, the
CMS and dacodi components, the introduction of our content and channels merging approach
achieved through something we call a weaving process as well as its impacts on publication,
4

feedback and statistics collection. Also we will introduce a Common Weaver Model which
enables scalability of the system by exploiting the fact that similar channels have common
characteristics.
Terminology
We define the following terms in order to establish a common understanding of the topic:
● Communication1 is the activity of conveying information according to Wikipedia.
Communication requires a sender, a message (an object of communication, information
or a form of information), channel (the medium) and an intended recipient. Bi directional
communication is underlies a broad process model that often starts with a publication
or broadcasting activity which can be followed by feedback, that again often triggers
the exchange of further information afterwards or even leads to engagement in long
conversation.
● Dissemination2 is the act of broadcasting content to the public without direct feedback
from the audience.
● Content Management is the set of processes and technologies that support the
collection, managing, and publishing of information in any form and medium. Digital
content may take the form of text, multimedia files or any other file type which follows a
content life cycle that requires management.
● RDFa is a W3C Recommendation that allows embedding of RDF statements into
XHTML documents, HTML4 and HTML5.
● Microdata is a similar approach to RDFa. It allows to embed semantics into existing
HTML content. Microdata aims to be simpler than RDFa and plays a major role in search
engine optimization (SEO).
● A Channel is a means of transporting a message, therefore a medium. When in our
definition an online channel does NOT equal a full communication platform. Potentially
every URI is a channel. For example a HTML page within a larger website can be a
channel.
● A Platform is a collection or a group of channels. For example Facebook is not one
channel, but a collection of multiple channels e.g. the Facebook wall being one of them.
A Platform allows access to more than one communication channel, (e.g. video, text,
image).
● Pull channels are channels that actively gather data from a predefined source. A
homepage (single html site) or Wiki page for example requests data from a server.
These data sources can be many fold (e.g. a semantic repository) but the procedure is
always the same: system pulls information from a source. For example also a Linked
Data endpoint can be queried using SPARQL and the extracted information can be
transformed, reused, etc. If we have the direct control over the underlying data we can
semantically annotate it using technologies like RDFa and microdata.
● Push channels are channels to which information has to be explicitly sent to. These
channels include email, bulletin boards and Web 2.0 platforms. None of these channels
actively gather information from external data sources. This means if we want to
distribute information to such channels we have to actively push it to the correct one.
Also due to the fact that the user usually does not have full control over the data pool
1 http://en.wikipedia.org/w/index.php?title=Communication&oldid=480484048
2 http://en.wikipedia.org/w/index.php?title=Dissemination&oldid=458980901
5

and storage it is not possible to control semantic annotation of for example a tweet or
facebook post.
● Information Item is the entity to be published. An information item may be semantically
enriched and thus described by an underlying concept. Viewed from the syntactic side
an information item is represented as XHTML with the possibility of RDFa annotations.
● User in our terminology is an agent (human or software solution) who executes a task
related to online communication.
● Adapters in dacodi are used to provide uniform access to all communication channels.
They are the linking part between the actual communication channel (e.g. Facebook API
for wall posts) and dacodi. We distinguish between two types of adapters: publishing and
retrieval.
2. Problem definition
After introducing the topic we now focus on defining the resulting problems that have to be
overcome if one wants to reach scalable and efficient online communication. On a high level,
the general problem is the following: A user has content that he wants to make accessible to
others. This content can either be published as static content on a traditional web page, as
a “status update” or something comparable on a social media platform, or as RDF triples in a
triple store or the Linked Open Data cloud. If the user desires to utilize multiple outlets - to reach
a wider and more diverse audience for example - the content has to be published on multiple
places. However, currently publication in multiple places results in duplicate effort and manual
labor.
While the problem may be very similar conceptually, publication on different channels works
differently if looked at the technical details. These are not mere technicalities or minor
differences however. When it comes to, for example, ownership of the published data, the
differences between a traditional web page and a social media platform are major. This shift of
ownership/responsibility implies further differences in regards to what operations are possible
on published data, e.g. modification and deletion of already published content.
A homepage is in most cases intended to be world readable without restrictions, whereas
social networks can be quite restrictive and make content only consumable for registered
customers. Additionally a homepage should be structured well in order to enable the user to
quickly discover the information needed. In most social networks recent content is automatically
delivered to a user in his stream.
The traditional Web and especially the Web 2.0 (Social Web) are becoming an inseparable
part in identity creation and represent a key medium for companies to communicate with
existing and potential customers. However, the opportunities for companies of leveraging
Web technologies for attracting more site visitors and reaching more target-group users is
accompanied by a number of challenges. These include as stated before technical difficulties,
but more importantly, handing the growing number and diversity of social platforms, specialized
6

news web pages, blogs, discussion forums and messaging services. We address these
hindrances by providing innovative marketing communication and impact-measurement
solutions. In particular, we offer the first product that employs semantics for creating a level of
abstraction over all communication channels, thus supporting the recommendation of suitable
channels and simultaneous publishing of content. In particular, we base the tool development
on four main approaches for handling complexity and reducing the amount of manual effort:
● Description of communication channels’ capabilities which is implicitly given through
clustering of channels into groups with similar functionality
● Semantic representation of the customer’s domain Information. SCEI makes use of
semantic annotations, which can refer to a domain ontology. These annotations are
useful for other services, as well as for the publication component of the tool (i.e. dacodi)
and play an important role in search engine optimisation.
● Channel recommendation
● Content transformation to fit a particular channel
Content distribution and feedback monitoring in various channels is a manual and labor
intensive task. Take, for example, video upload. A user has to upload a video on potentially
multiple platforms (YouTube, Vimeo, Facebook video), copy and paste the video title/description
and enter tags manually. After the upload process, the user may want to notice his clients via
social networks about the new video. Thus, the video link has to be copied and posted as a
status update. Further, a short description alongside the link would be beneficial, which has
again to be written or copied. Our tool wants to eliminate any non-automate-able manual labor
in this and similar processes.
The resulting software product saves time and hides technology specifics behind an easy-to-use
interface, enabling a flexible and scalable multi-channel communication strategy. Furthermore,
the tool also uses different metrics to statistically capture and analyze the online reach and
impact, providing means for evaluating the online marketing strategy but also to conduct
reputation management by timely reacting especially to negative posts and feedback.
3. Reference architecture
In order to solve the problems mentioned above, this section explains our solution on a
conceptual level.
The central element of our approach is the separation of content and online channels. This
allows reusing the same content for various communication means. Through this reuse we
want to achieve scalability of multi-channel communication. The explicit modeling of content
independent from specific channels also adds a second element of reuse: Similar operational
entities active in the same domain can reuse significant parts of such a content model.
Separating content from channels also requires the explicit alignment of both. This is achieved
through a weaving process.
Figure 2 shows the SCEI high level, reference architecture. The following Sections give more
details about reference architecture, its components, where the content generated by the user is
7

stored and the above described weaving process in general.
Figure 2: SCEI reference architecture
3.1. Semantic layer/domain knowledge
In order to abstract the domain specific communication from the actual channels, thus lifting the
distribution and data collection in channels to an upper conceptual layer, we need semantics
on top of our solution. This layer on the one hand captures data in domain specific ontologies
on the other hand describes the various communication channels. In order to interweave the
domain specific concepts with the underlying communication channels we propose a weaving
process which will be explained in more detail within this document. In the end this semantic
layer will smartly decide which kind of content is distributed to which channel in which form.
Let’s take for example a hotelier who wants to build up or extend the online presence of his or
her business. First of all there is a need to know all relevant channels which reflect the target
group. This list can include things like a homepage, mailing lists, fora or social networks. After
knowing the available channels, accounts have to be created on some selected platforms.
Additionally to that, a hotelier has to be present in various rating and review sites in order to
maximize business opportunities. These channels can be manifold and it is extremely hard to
keep an overview of what’s going on in these channels without technical assistance i.e. a tool
that distributes and aggregates all channels in a single interface. However technical details, as
well as emerging channels shall be integrated quickly and transparently. So the end user needs
to work on a level he or she understands i.e. a domain specific layer with concepts well known
in the industry sector instead of handling each channel separately.
8

3.2. Separation of components
The STI online communication tool is split into a set of components that can be conceptually
grouped in two major parts, namely:
1. The content management system (CMS), together with the domain and task specific
interfaces and the workflow engine and communication patterns component
2. The data and content distribution component, responsible mainly for the Web 2.0
communication (i.e. data distribution to and feedback collection in push channels).
Obviously this separation is needed for satisfying the different requirements of push and pull
channels due to the contradicting nature of these two approaches and their application by
different existing channels. One must however note that both paradigms have in common
that multi-directional communication (conversations between multiple users) can occur and
often statistical information can be extracted from the channels. Also through this component
separation we guarantee maximum scalability, allow easy adaptation to multiple use cases and
simplify the integration with the seekda hotel booking solution as well as other 3rd party apps.
Another main motivation of this separation is to have a single layer which unifies social media
platforms (Web 2.0 channels) - namely the data and content distribution component. This
enables easy integration by providing a single common interface, as well as the possibility
of external use, as mentioned above. Another approach would be to integrate everything
in the CMS of your choice, thus disregarding any possibility of loose coupling, reuse and a
component-based architecture.
On each data change in the CMS another module sends the newly created/updated content to
the data and content distribution component API. This loosely coupled approach allows an easy
exchange of the CMS part and makes the data and content distribution component independent
from current content management solutions. The data and content distribution component API
makes it also possible to create use case specific interfaces for data and content distribution
component (e.g. enables white labeling) and quickly integrate it into 3rd party applications in use
cases where the “heavy weight” CMS part including things like content hosting is not needed.
Following use cases were defined to show the advantage of such a flexible architecture:
● A hotelier does not want to change his existing homepage infrastructure and CMS but
nevertheless profits from addressing multiple Web 2.0, e-mail and rating channels via
the data and content distribution component. Setup and usage of the software must be
easy in order to be performed by an averagely skilled user. Here the communication with
the customer, including engagement in conversations via the tool, is the primary focus
of the user. Such a tool can be offered with a low pricing scheme since the data and
content distribution component does mainly the content distribution and does not have to
care about site hosting and content per se.
● The dissemination partner of an EU project needs a fully fledged, out of the box solution
to address all important channels at once. The CMS with semantic data export in
combination with data and content distribution component enables this. The initial
9

setup is however a non trivial task since homepage structure and the links to LOD
vocabularies and other ontologies must be created. However we can expect a more
technically skilled user to operate on this full package.
● A marketing agency with a multitude of different customers faces several other
problems. Each customer wants fully fledged offline and online presence in multiple
channels. SCEI is very flexible within this regard. It is possible for them to maintain the
full Web 2.0 presence of a customer via the data and content distribution component and
if needed to also provide their customers with a state-of-the-art CMS solution.
In Figure 2 we provide a high level overview of all SCEI components and actors which are the
following:
● User: Person that operates the software and works on the level of information items
rather than channels. We distinguish between several specific user roles, namely:
○ Content creator: Person that generates the content of the items to be
disseminated.
○ Workflow designer: Person that define communication patterns and workflows
involving communication, multi-channel publishing and social media monitoring
within an organization (e.g. a hotel business)
● Information Item: Content that traverses the system and is stored, distributed and
transformed within the process.
● CMS: The content management system (in our case Drupal 7.x) exposes the user
interface to the user as well as HTML in the form of a website, accessible via the Web.
○ Domain and Task specific UI: Dependent on the application domain, user, task
and role the user interface adapts itself and shows easily accessible all relevant
information.
○ Workflow Engine/Communication patterns: In order to support publication
and controlling workflows SCEI contains an engine supporting such. Well known
communication patterns help in these workflows, their planning and execution.
○ RDFa annotation: Enriches the information item with semantic metadata in
order to export it to an RDF repository as well as easen the distribution via the
data and content distribution component, because the tool understands the
meaning of the information item instead of just the structure.
○ Scheduling: Contains rules about delayed or recurrent publication of the
information item.
○ DB: Database which stores the actual content.
○ RDF export plugin: Exports all information item for the DB to an external
repository.
● Semantic repository: External RDF repository which exposes all information items via
a SPARQL endpoint. It also contains the domain and channel models and makes this
information accessible to both the CMS and data and content distribution component.
● Data and content distribution: Distributes content in and aggregates information from
all push channels.
○ API: Makes the data and content distribution component accessible via the CMS
10

as well as other 3rd party applications which makes this part integratable in
external software solutions. Receives HTML which can be additionally enriched
with RDFa annotations.
○ DB: Database stores references to the information items and their representation
in the different channels.
○ Publishing Module: Is responsible for distribution the information item in
different channels.
■ Content Extractor. Analyzes the HTML coming from the API and
extracts all relevant information.
■ Concept to channel mapper: Decides which part of the information item
will be pushed to which channel.
■ Content Transformer: Transforms the content in order to fit the channels
e.g. shortens a text to 140 characters for Twitter publication.
■ Scheduling: Contains rules about delayed or recurrent publication of the
information item.
○ Statistics module: Collects and stores all valuable statistical information coming
from the various channels e.g. site visits, number of views, and such.
■ Item Analyzer: Handles statistics of an information item in various
channels.
■ Channel Analyzer: Handles statistics coming from a specific channel
regarding all information items published within.
○ Engagement Module: Is responsible for direct interaction in various channels.
■ Feedback Collector: Gathers feedback form all channels in order to
present it centrally. This can be for example comments or reviews.
■ Interaction Component: Enables to react to the gathered feedback. For
example to reply on online comments.
○ Impact Analyser Module: Figures out which impact publications have had.
■ Impact Analyser: Specialized form of statistics that try to figure out
how to efficiently leave impact in the online world. We differentiate here
between real impact, based on active publications, as well as potential
impact, meaning how many people the user can potentially reach in the
various channels, given a limited amount of e.g. friends or subscribers.
3.3. Data and content storage
The CMS actually stores data and content (for example pictures) in its internal database.
References, meaning links, to these data items can be found in the website’s HTML code as
well as the exported RDF triples. The data and content distribution component, on the other
hand, is not meant as a content hosting solution and therefore does not store all content and
data. The only exception where the data and content distribution component stores data like
images and videos, although temporarily, is when publishing is delayed by the scheduling
mechanism.
We distinguish between dynamic and static information publishing. With static information we
11

refer to a “distributed profile” in all Web 2.0 channel that can be changed at once. Such a profile
contains things like contact information or a representative picture, in short things that should
not change frequently and are valid without temporal constraints. Dynamic information are
things that will be pushed to e.g. news feeds and represent information at a certain point in time.
For such publications the data and content distribution component only stores a reference to
which channels content was distributed and a textual description of the content, so that it can
later be identified by the user and specific feedback can be assigned to it. Acting mainly as a
speaking tube, the data and content distribution component provides a lightweight and scalable
solution.
3.4. The weaving process in general
As mentioned in the previous Section, the general problem is one of content distribution and
feedback collection. We define a “weaving” process to formalize the steps necessary to solve
this problem. In general, this process can be broken down as follows:
1. Content input
2. Selection of publication channels
3. Content adaptation
4. Publication
5. Collection of feedback
6. Collection of statistics
4. Reference implementation
We provide a reference implementation for the reference architecture presented in Section
3. The reference implementation is outlined in Figure 3 and is splitted into two major parts:
a content management system (CMS) part based on Drupal 7.x, and the data and content
distribution in-house implementation called dacodi. Additionally, a external semantic repository,
namely OWLIM, is used to save content. The rest of this section provides the technical details
on the reference implementation.
12

Figure 3: SCEI reference implementation
4.1. Content Management System
As basis for our CMS solution we use Drupal 7.x. The reason is its native RDF support, the
availability of additional semantic modules, such as a SPARQL endpoint or microdata export
and the possibility of third-party module development.
The publication of new or the updating of existing content (information item) starts with the
responsible person creating or changing one piece of information. This process is handled
by the underlying CMS. If necessary, scheduling information could be provided to postpone
the publication. After a successful change the content is saved to the external OWLIM
repository and sent to the dacodi API. The CMS utilizes dacodi to extend its content distribution
capabilities. Likewise the CMS acts as a specialized kind of user interface from the dacodi
viewpoint. In the following we will further outline how RDF data is exported by the CMS and
how the first part of the weaving process works. The second part of the weaving process will be
described in the dacodi Section of this document.
4.1.1. Domain and task specific UI
13

The Domain and task specific user interfaces are the components through which the content
users are directly interacting with our system. They are sub-components of and directly
implemented using the CMS.
The design and look-and-feel of these components are very much adapted to the mind setting
of the user, supporting them to specify content in a terminology that is familiar to them. For
example hoteliers will specify content items that they want to be disseminated in terms of offers,
touristic packages, etc. The domain and task specific user interfaces support thus information
dissemination abstraction based on the concrete domain, independent of the channel(s) of
dissemination.
The domain and task specific user interfaces also allow the user to manage and solve task
specific activities including yield, brand and reputation management, customer relation
management and online advertising.
4.1.2. Workflow engine and communication patterns
In order to support the user we offer a workflow engine together with support for communication
patterns. This component enables user to define and manage complex workflows on top of
the communication, multi-channel publishing and social media monitoring underlying SCEI
components. Such workflows have usually a long lifespan and involve multiple employees
working together on improving the visibility, reputation and communication of an entity.
The workflow engine and communication patterns component can be used to manage the
communication workflow including assigning, tracking and responding to user feedback. Using
this component one can define and manage steps and protocols to be activated when certain
events related to the published information occur., e.g. a bad comment on a post in Facebook
is written. Take for example a hotelier. Using the workflow engine and communication patterns
component, the hotelier can specify and manage when and which of his employes, depending
on his availability, are taking care of responding to customers posts on various channels about
his hotel, or engage with customers to present them new hotel offers.
4.1.3. Export of RDF data (OWLIM Integration)
The export of the CMS content to an external triplestore repository allows the publication of
the website data as a bubble in the linked data cloud. The consistency of the two databases
is guaranteed with the help of hooks that are triggered by the CMS on each add, updated
and delete operation. Hooks are functions that allow to intercept the CMS internal workflow.
After an operation was successfully executed the RDF export plug-in creates triples and uses
the Sesame REST API to add or change the content in OWLIM. For semantically annotating
(RDFa and microdata) the content on the homepage exposed by the underlying CMS we use
the Drupal internal database since available Drupal modules already enable this annotation
functionality. The OWLIM repository mainly serves as linked data SPARQL endpoint.
As seen in Figure 3, the changes to the CMS are not intrusive, since the added functionality is
provided by plug-ins. Two additional plug-ins need to be written: One for the OWLIM integration,
and one for dacodi.
4.1.4. The Weaving Process within the CMS
In regards to a content management system, the weaving process looks as follows:
1. Content input
The content is entered in the CMS, directly by the user of the CMS.
14

Where should the document be published, in regards to the internal document tree of the
CMS. If a distribution to social media platforms via dacodi is desired, Web 2.0 channels
can be selected as well.
Content adaptation is not necessary for the CMS, since there are no content restrictions.
4. Publication
The document is published in the CMS and - if desired - as triples in the LOD cloud. The
information item will be passed to dacodi during the publication phase, along with the
previously selected Web 2.0 channels.
Direct user feedback, like comments, shares, retweets, etc. is gathered by dacodi
and can be queried by the CMS using the dacodi API (see more in Section Feedback
Collection in dacodi).
Collection of visitor numbers and demographic data can be done via a tool like Google
Analytics or the open source solution PIWIK.
The publication as triples in the LOD cloud (or to any external triplestore), as mentioned in step
3 of the weaving process, is done by a plugin which integrates OWLIM in the CMS.
4.1.4.1. Publication in CMS
The CMS component enables the publication of content on a homepage. It also provides
functionality to annotate the website’s HTML with RDF data and export these RDF data as a
whole in order to make it machine understandable.
4.1.4.2. Feedback collection in CMS
Feedback from the CMS can come from various sources like for example an internal
commenting or rating system.
4.1.4.3. Statistics collection in CMS
Statistics within the CMS can come from various sources like Google Analytics or PIWIK for
analyzing page visits or internal comment and feedback systems.
In the following Section we will explain how dacodi is able to distribute content in multiple
channels.
4.2. dacodi
The dacodi component is used to distribute information in various Web 2.0 and email
channels, as well as collect and analyze feedback from those channels and actively engage
in conversations (i.e. reply to comments). Central to dacodi is the weaving process, which
enables channel selection based on the semantics of the information item to be distributed and
content transformation based on these channels. If manual effort is necessary, for example for
entering content in a certain spot to a wiki system, the content can be sent to the responsible
webmaster via e-mail. We will describe how the weaving process works within dacodi and
how the component interacts with online channels using certain Adapters for publication and
feedback and statistics collection.
15

4.2.1. The Weaving Process within dacodi
The ultimate goal of the weaving process is the semi-automated publication of the information
item in fitting channels, including necessary transformations, based on the information type.
Thus, the weaving process can be broken down in the following steps:
1. Content input
In the case of dacodi, this equals the acquisition of the information item; either through
the API (coming e.g. from the CMS) or a dedicated user interface.
Selection of appropriate Web 2.0 channels based on the information type.
Transformation of the information item into a Common Weaver Model (CWM) instance.
4. Publication
Publication of the (transformed) information item in the selected channels.
Feedback collection via the APIs of the used channels.
Statistics collection via the APIs of the used channels.
We will discuss the steps necessary, channel selection and content transformation, for the
weaving process in the following Subsections. Afterwards we will explain in detail the Common
Weaver Model which is part of the content transformation component and specific to dacodi but
not the CMS part.
Channel Selection
Based on the information item type (e.g. a business event), a fitting channel for the information
item will be selected (e.g. business event is announced on LinkedIn but not Facebook). The
central component of the channel selection process is the (Concept-to-Channel) Mapper, which
maps each concept to the appropriate channels.
Consequently the Mapper gets a concept as input, and gives back a list of channels which
are relevant for the concept. The Mapper of the prototype implementation uses a static
mapping which maps every concept to a list of channels. Due to the modular architecture of the
application, the mapper component can be easily replaced with a more sophisticated, dynamic
approach. It would be possible, for example, to implement a Mapper that incrementally learns
from user adjustments and thus alters the channel mappings based on the users needs.
Transformation
For every channel the information item has to fit in, a transformation is necessary. For example:
A business event might include fields such as short title, long title, description, start date, end
date, location and venue. Further, there might be an accompanying image which represents
the event - like a poster. A channel which only takes short text messages (Twitter, for example)
can’t handle all those fields. Thus, the information item has to be transformed into something
what we call a Common Weaver Model instance (CWM).
To expand on the previous example, one could think of combining the most important
16

information of a business event (short title, start date, end date, location and venue) into a string
which fits the channel’s restrictions - Twitter’s 140 characters, for example. The transformer
component defines what transformations are necessary to go from Information Item to Common
Weaver Model instance.
4.2.1.1. Common Weaver Model
The Common Weaver Model3 (CWM) exploits the fact that similar channels have common
characteristics. For example: Facebook status updates and Twitter enable the user to share
short text messages in form of status updates. YouTube, Vimeo and Facebook video enable the
user to upload and share videos. After looking at various Web 2.0 channels, we have identified
the following Common Weaver Models:
● Text: A String of varying length. Online communication as it is today relies heavily on
the exchange of short text messages. In essence, those messages are simply Strings
of varying length. Depending on the platform, such text messages can be between 140
(Twitter) and many thousand characters (63,206 in the case of Facebook).
● Link: A common hyperlink denoted by the <link /> or <a /> HTML element.
● Image: A two-dimensional image denoted by the <img /> HTML element. While support
may differ depending on the Channel, possible Internet media types include: gif, jpeg,
png, svg, tiff.
● Video: A video file. While support may differ depending on the Channel, possible
Internet media types include MPEG-1 video with multiplexed audio, MP4 video, Ogg
Theora video, QuickTime video, WebM Matroska-based open media format, Matroska
open media format, Windows Media Video.
● Presentation: A presentation file. We want to support this type - and thus related Web
2.0 platforms like slideshare - in future version. Not supported in the prototype.
● Audio: An audio file. Not supported in the prototype.
During the weaving process instances of those models will be extracted from the information
item and send to the selected publishing adapters. Each Common Weaver Model instance is
stored internally using a unique identifier and grouped by the information item to which it is
related.
These CWM instances are extracted from an information item. The granularity of the extraction
depends on the information item which is to be published. For example: if the user simply wants
to publish a single link in various channels, it makes sense to extract the link and publish it. On
the other hand, if a more complex information item contains dozens of links, it does not make
sense to extract and publish every link (this would equal annoying spamming), unless the user
explicitly wants to do so.
3 The model in Common Weaver Model refers to a model from a software engineering point of view, as in
MVC (Model-View-Controller). A model manages the behavior and data of the application domain.
17

Figure 4: Extraction of Common Weaver Model (CWM) instances from an information item.
When an information item is published, e.g. a business event, CWM instances are extracted.
Expanding on the business event example introduced in the Transformation section: If the
business event includes an image, it will be extracted and published to fitting image channels
like Flickr. The essential information about the event can be combined in a string format and
published via text channels, such as Twitter, Xing or Facebook. Since every CWM instance
knows from which information item it was extracted, a link to the original information item (in this
example: the business event) can be embedded, e.g. in the description of the image.
4.2.1.2. Publication in dacodi
The publisher module takes care of two things: publication of the information item (in this stage
of the weaving process represented as Common Weaver Model instances) using adapters, and
scheduling.
We plan to support scheduling in two ways: delayed publication and repeated publication. For
example: Delayed publication can be used to announce an event or a special offer at a specific
time, whereas repeated publication may be used to send reminders (e.g. for a call for papers) in
all channels.
4.2.1.3. Feedback Collection in dacodi
Every Information Item that is published by dacodi is tracked by the system, to provide statistical
information and a per-channel impact analysis. This feature allows the user to see how well the
18

published information item was received, without having to check every channel individually.
Feedback
Basically there are three forms of feedback that are supported by various Web 2.0 platforms
and thus relevant for dacodi:
● Unary feedback. Any feedback that is a predefined, positive feedback. Examples: “like”
on Facebook, “retweet” on Twitter, “favourite” on flickr, “favourite” on YouTube, etc.
● Binary feedback. Any feedback that is a predefined, positive or negative feedback.
Example: thumbs up/down on YouTube.
● Rating/ranking. Feedback that can be quantified on a discrete scale. Example: star
rating on a hotel review platform.
● Textual feedback. Any feedback that is user-created, in form of replies, comments or
any other form of written feedback. NLP techniques can be used to analyze the textual
feedback to provide the user with additional information, i.e. if the comment/reply was a
positive or a negative one. A user can directly react to textual feedback within dacodi if
the underlying channel allows this functionality.
4.2.1.4. Statistics Collection in dacodi
There are several statistical metrics that are relevant for the user. While the unary, binary,
ranking and textual feedback is centered on the information item, statistics are relevant on a
per-item and per-channel basis. Examples:
● Amount of unary, binary, rating, and textual feedback per information item (this includes
features like “most discussed information item”, i.e. the information item with the most
textual feedback).
● Number of information items published in each channel over a certain amount of time
(day, week, month, year).
● Calculation of a combined impact metric per channel, based on feedback analysis of the
information items published in the channel.
4.2.3. Adapters
As mentioned before, we distinguish between two types of adapters: publishing and retrieval.
The purpose of a publishing adapter is to publish an information item in a certain channel.
Retrieval adapters are used to gather information about already published information items.
Since the APIs as well as the offered functionality differ from channel to channel (e.g. Twitter’s
and Facebook’s API differ) a separate adapter for each channel needs to be written.
In our prototype we intend to create publishing and retrieval adapters for the following platforms:
YouTube, Facebook and Twitter. All three of them have a Web API and cover a majority of the
features we want to realize, such as publishing videos, texts, images and links. This is a starting
point for implementing new adapters that provide similar functionality.
We have identified the following, channel specific features that each adapter has to be able to
handle:
●Mapping CWM properties to appropriate properties in the published
communication channel. For example, a Tweet post’s text property is called ‘text’
19

whereas a Facebook post’s text property is named ‘message’. Since we have to
implement an adapter for each communication channel we want to address we will
implement this by a simple mapping routine in each adapter.
● Authentication and authorization to the communication channel. Most Web
2.0 communication channels rely on OAuth / OAuth24 to realize authorization and
authentication. However, some of them rely on OpenId5, basic HTTP authentication or
other form-based authentication mechanisms to restrict user access. The adapter has to
be able deal with these individual mechanisms and has to store and load the credentials
of each users.
● Publish a specific CWM instance. As mentioned above, the publishing process varies
from platform to platforms thus this functionality has to be abstracted. This holds also
true for retrieving feedback from different platforms.
Adapter loading and naming conventions
We designed our adapters and adapter loading mechanisms to achieve the following three
goals:
● No adapter duplication. The same functionality should be achieved with the same
code. (Minimize codebase, achieve simplest possible code base).
● A common adapter structure for all platforms. Platforms are differently structured.
However, for the clarity of the dacodi we want a uniform way adapters are integrated
in the system. Adding the same functionality (e.g. adding an image channel) should be
achieved in a similar manner in all platforms.
● Automatic adapter loading and execution. There should be no manual effort involved
in adding a new adapter to the system, except for programming the adapter.
To achieve these three goals, we designed the system carefully, introduced some naming
conventions and loading conventions for our adapters. These are described in the following
sections.
Figure 5 depicts the motivation for our design. The illustration sketches that social media
platforms offer more than one different way to publish information. Additionally, each user
account on this platform allows access another, similar set of communication channels, e.g.
when one has two accounts on Twitter, they have a duplication of all available communication
channels on Twitter, say two text-channels, two image channels and so on. The difference
between those channels are merely the user credentials that are used to authenticate for
the post. Different platforms allow to post similar common weaver items, though. Adhering
to the SRP software development principle6 we chose to write an adapter for each explicit
communication channel in each platform individually. This - together with a file naming
convention - also allows us to automatically load and execute adapter classes, without having
to change configuration files or any additional manual effort. If an adapter class is not found this
channel is simply not supported.
4 http://en.wikipedia.org/wiki/Oauth
5 http://en.wikipedia.org/wiki/Openid
6 http://en.wikipedia.org/wiki/Single_responsibility_principle
20

Figure 5: Platform as channel groups offering multiple ways to publish information
We named the components that define a communication channel in dacodi: a channel, a
platform and user credentials. In detail, they are:
● Channel Type: Is a virtual grouping of channels that allow publishing the same common
weaver items, e.g. image, video or text. This is depicted in Figure 5. It is virtual, because
it is split up into many different adapters to many different platforms but is accessed in a
uniform way nonetheless.
● Platform: Is a grouping of channels that have the same user credentials. An example
for a platform or channel group is Facebook. The notion of the channel group has been
introduced since a platform such as Twitter or Facebook actually allow access to more
than one communication channel, (e.g. video, text, image)
● User credentials: This is the information needed to authenticate/authorize a client to
a certain platform and associate it with a certain account. In dacodi user credentials
contain the following information: an account id which associates a channel group with
a user (i.e. the Facebook account id 1234 with the dacodi user 27), the authorization
token- and secret which store information that is required completing actions in a
platform such as posting (i.e. an OAuth2_token associated with the account or a
password), the consumer key and consumer secret, which contain information about the
application that is about to publish (you can think of it as the authentication of dacodi
to guarantee the platform that dacodi is actually itself publishing on the user's behalf).
The notion of user credentials have been introduced, since a user may have multiple
accounts on one platform.
21

Scei technical whitepaper-19.06.2012

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

Similaire à Scei technical whitepaper-19.06.2012

Similaire à Scei technical whitepaper-19.06.2012 (20)

Plus de STIinnsbruck

Plus de STIinnsbruck (20)

Dernier

Dernier (17)

Scei technical whitepaper-19.06.2012