Spam is a wide term and used mostly in emailing and blog commentaries but everyone grasps what it is. Generally, spam is “undesired electronic content”. But the matters are not of course absolute as it might be undesired for some but not for most – or was it vice versa?
So what has spam to do with Linked Data publishing?
Check also out earlier presentations about Vinge's semantic technology:
http://www.slideshare.net/JrgenKerstna/can-spsrqlexplore-query-with-vinge-tutorial
http://www.slideshare.net/JrgenKerstna/failing-fast-with-explorequery
2. Spam is a wide term and used mostly in emailing and blog commentaries but everyone grasps what it is. Generally,
spam is “undesired electronic content”. But the matters are not of course absolute as it might be undesired for
some but not for most - or was it vice versa? In our everyday lives we trust spam filters which generally do the job
well, but from time to time we still face situations where some messages which we would have desired to receive
are taken out by the filter. Then we customize the filter and teach it and we accept that spam is not always spam
and vice versa. It is individual and diverse and not necessarily always coming from ill-meaning or unethical
individuals.
So what has spam to do with Linked Data publishing? Lets look at the example
3. In a Linked Data browser, when looking at the describe page of somebody or something, the amount of items may
4. be enormous and obscuring the sight. As an example look at the view of Lionel Messi in DBpedia. I would like to
focus on inferred and materialized semantics about the “Thing”, i.e. classifications of it, which is asserted as rdf:type
properties.
Everyone obviously looks at this list differently and it depends on why are you on this page. But some concepts
really stick out as not very useful knowledge. Concepts like Whole, Winner, Citizen, Medalist, YagoLegalActor,
Contestant, Player look like vague and carrying no real useful semantics. On the other hand concepts like
PeopleFromRosarioSantaFe, 2007CopaAmericaPlayer looks like very specific and hardly of common interest to wide
group of linked data consumers. Then there is a third category like BasketballPlayer, which makes you doubt this
tagging is correct. I would be surprised to see Messi in a same list with Kobe Bryant!
Another type of semantic spam in a form of redundancy is parallel ontologies. For example you find SoccerPlayer
and Athlete type assertion from both dbpedia and umbel ontologies. Both foaf and scheme.org define the Person.
I appreciate that there is a reason why these concepts are defined and asserted, and for some people under some
circumstances they may be relevant and useful. However, for me whose purpose to visit this describe page is to
find a sample of a soccer player, which I can use to understand the ontology describing soccer players in general -
and construct a SPARQL query that fetches data for some analysis or aggregated information about them.
So for me these assertions are semantic spam. It obscures my view of relevant concepts and properties, it wastes
my time to browse through the long list.
How would I like to deal with it? I would like to apply my spam filter.
Firstly, I would like to see that those additional assertions are from a separate named graph and I would like to
have a choice to filter triples only from named graphs I am interested in. In this case of DBpedia, they all are mixed
in <dbpedia.org>. I think that if kept in separate context, then these type of assertions originating from some
specific need can be very useful. I have used this mechanism of typing things of local interest for query performance
optimization and efficient semantic cutting from a huge graph of billions of facts. But, these assertions are then kept
in a “sandbox like” named graph, private to a person or a team.
Secondly, if any author and publisher of semantic concept is given a freedom to put their stuff in a common bowl
then there is another possibility to filter based on a namespace.
Explore&Query provides a support for both filter types.
Below is how the latter, namespace based method looks like. It is a namespace prefix registration based feature. In
the system configuration the user can add prefixes and their namespaces they consider relevant. Anything else not
having a defined prefix is displayed as a full URI. In a describe page along with Navigation Map actions, there is a
button “NS Control On/Off”
5.
6. Here you can toggle the namespace filter “on” and “off”
The example I used is DBpedia and Yago classifications, which are both immensely impressive and significant linked
data sources. I used them because they are available and really make a good example to imagine what might
happen if in the Semantic Web the authors have a freedom to annotate facts and assert classifications, meta facts
or anything else circumstantial linked to raw original facts.
The mechanisms for semantic spam control exist in a form of named graphs, graph level access control and
namespaces. But the rules need to be defined and applied. Because what is a spam to some might be a valuable
knowledge to others.