Semantic Technology in Publishing & Finance

Semantic Technology
in Publishing & Finance
Triplestores and inference, applications in Finance, the GraphDB engine,
text-mining, projects and solutions for financial media and publishers
Keystone Industrial Panel
ISWC 2014, Riva del Garda, 18 Oct 2014
Semantic Technology in Publishing & Finance Oct 2014 #1

Outline
• Introduction to Ontotext
• Clients, cases
• Text Mining, Media and Publishing Solution
• SemTech applications in Finance
• Wrap-up

Ontotext
• Information management company providing text analysis,
data management and state-of-the-art semantic technology
• 75 employees, head quartered in Sofia, Bulgaria
• Sales presence in London, Washington DC, and Boston
• Clients include BBC, AstraZeneca, US DoD, OUP, Wiley, Getty…
• Over 400 person-years in R&D to create a one-stop shop for:
– Content enrichment
– Data management
– Graph database engine
• Open and standard
compliant technology:
– RDF(S), OWL, GATE, Sesame

Interlinking Text and Data

Semantic Annotation
Bronchial Diseases
pmid:17714090
Clinical and experimental pharmacology …
Semantic Technology in Publishing & Finance #5
umls:C0035204
COPD
Respiration Disorders
umls:C0006261
Chronic Obstructive
Airway Diseases
Asthma umls:C000496
Ian A Yang
Oct 2014

Semantic Annotation
Semantic Annotation goes far beyond
tagging. It allows search using enrichment,
linking and rules to return explicit and
implicit results – complete intelligence.
Semantic Technology in Publishing & Finance
Content Enrichment
• Text Mining &
Classification
• Curation
• Quality Monitoring
Data Management
• Ontologies and
Semantic Annotation
• Web mining
• Identity Resolution
Graph Database
• Standards Based
• 24-7 Resiliency
• Hybrid Semantic
Queries & Search
Oct 2014 #6

What is RDF Good for?
• Metadata-based content management
– Metadata represents a re-usable result of content analytics
– It can be repurposed allowing for a wide range of applications
– Most of the search engines do analytics, but the results are not
explicit; so, they cannot be validated, refined and used by other
applications
• Linking text and structured data
– Allows structured, uniform and efficient access to diverse domain
models, taxonomies, dictionaries, reference databases
• Reference data management
– E.g. product catalogs and taxonomies that are too structured to be
managed with NoSQL, but too diverse and interconnected for SQL
• Using open linked data (LOD)
– A growing amount and diverse public data can be used in enterprise
Knowledge Management applications

LOD: Growing Exponentially
Linked Data Datasets
27 43 89 162
295
822
2,289
2007 2008 2009 2010 2011 2012 2013
• July 2013 stats: 2 289 datasets (http://stats.lod2.eu/)
• Growing exponentially (see the dotted trend line)

How Does Inference Help?
• Intelligent mapping of queries to data
– This matters a lot when an application should query a dataset
combined from 10+ sources, which evolve independently
– There is no way application developer can stay on top of all schemata
and all datasets, all the time
• Finding patterns and inferring new relationships
– Think of someone constantly looking for patterns that elicit new
relations, which can match patterns that elicit other relations …
– Or someone who goes deeper and deeper into finding new ways to
rewrite a query, over and over again, until all alternatives are
exhausted
• Get deeper results and more complete results
• Cheaper data integration, easier querying

Ontotext Technology Portfolio

Outline
• Introduction to Ontotext
• Clients, cases
• Wrap-up

BBC: The Perfect Application
Since year 2000 Semantic technology was striving for:
• Pertinent applications, a really good use case
• Real high-profile projects to prove its maturity
The “Dynamic Semantic Publishing” architecture
implemented by the BBC for its FIFA World cup 2010
web-site filled this gap!
It demonstrates:
• How RDF database serves well, where RDBMS fail to
• How text-mining and triplestores complement one another
• How inference adds value at a decent scale
• 24/7 live operation that cannot work without a functional triplestore

Ontotext and BBC
Profile
• Mass media broadcaster founded in 1922
• 23,000 employees and over 5 billion
pounds in annual revenue.
Goals
• Create a dynamic semantic publishing
platform that assembled web pages on-the-fly
using a variety of data sources
• Deliver highly relevant data to web site
visitors with sub-second response
Challenges
• BBC journalists author and publish content
which is then statistically rendered. The
costs and time to do this were high.
• Diverse content was difficult to navigate,
content re-use was not flexible
• User experience needed to be improved
with relevant content
"The goal is to be able to more easily and
accurately aggregate content, find it and
share it across many sources. From these
simple relationships and building blocks you
can dynamically build up incredibly rich sites
and navigation on any platform."
John O’Donovan
Chief Technical Architect
Oct 2014 #13

BBC: The Perfect Application (ctd)
• The BBC’s FIFA World cup 2010 project was widely
recognized as the best showcase for SemTech
– It used OWLIM as a triplestore (chosen after a thorough evaluation)
– It triggered a wave of adoption of the technology
• The next milestone: London 2012 Olympic Games
– The two most important websites used the DSP architecture: the
official one, operated by Press Association, and the one of the BBC
– Ontotext text-mining technology was used for content enrichment
• Four years later this application pattern is still the
best use case
– And there are still no other triplestores that can survive such load,
judging by the LDBC Semantic Publishing Benchmark, public
information and feedback from the industry

Ontotext and AstraZeneca
Profile
• Global, Bio-pharma company
• $28 billion in sales in 2012
• $4 billion in R&D across three continents
Goals
• Efficient design of new clinical studies
• Quick access to all of the data
• Improved evidence based decision-making
• Strengthen the knowledge feedback loop
• Enable predictive science
Challenges
• Over 7,000 studies and 23,000 documents
are difficult to obtain
• Searches returning 1,000 – 10,000 results
• Document repositories not designed for
reuse
• Tedious process to arrive at evidence
based decisions
Oct 2014 #15

Context-based Disambiguation

Ontotext and LMI
Profile
• Established in 1961 to enable federal
agencies
• Specializes in logistics, financial,
infrastructure & information management
Goals
• Unlock large collections of complex
documents
• Improve analyst productivity
• Create an application they can sell to US
Federal agencies
Challenges
• Analysts taking hours to find, download
and search documents, using inaccurate
keyword searches
• Needed a knowledge base to search
quickly and guide the analysts – highly
relevant searches
• Extracts knowledge from collection of
documents
• Uses GraphDB to intuitively search and filter
• Knowledge base used to suggest searches
• Hyper speed performance
• Huge savings in analyst time
• Accurate results
Oct 2014 #17

Some of our clients
#18
The most
popular
financial
newspaper
Semantic Technology in Publishing & Finance Oct 2014

Outline
• Introduction of Ontotext
• Clients, cases
• Wrap-up

Publishing and Media Solution

Solution Features
• Dedicated solutions for media and publishing
• Based on the Ontotext Semantic Platform
• Mature implementation and continuous adaptation
methodology
• Introducing advanced features to the authoring,
editorial and publishing phases of content and data
workflows

#22
Methodology

Architecture Overview

Authoring
Related assets – as you type
Related entities and concepts
Entity profiles and facts on the fly
Create higher value content at the same cost
Semantic Technology in Publishing & Oct 2014 #24
Finance

Contextual Authoring
Semantic Technology in Publishing & Oct 2014
Finance
#25

Curation
Continuous adaptation through editorial feedback
Query driven publishing templates
Dynamic re-purposing and reuse
New publishing products with the same content
Finance

Example of Client Integrated Curation
Finance

Example Curation Tool: PressAssociation

Monitoring and Curation Curation Tool

Continuous Adaptation

Publishing
Dynamic construction of products (e.g. topic
pages)
Personalized content streams
Semantics driven trend and user analytics
Behavior driven personal asset streams

User Behavior Tracking
perform
comments
votes
posts
preview
read
Article
contains leads to
read
leads to
preview
Search
Action
Result
Date
FTS Q. Tag
Cat
Tag set
results
cat
taxonomy
Search Log
-------------
-------------
-------------
-------------
-------------

Personalized Recommendations
User Profile
Behavioural
and
Contextual
Simil ar ity Reads

#34
Methodology

Methodology

Complete Domain Ontology

Example KB for 50 daily publications

Methodology

Design of Machine Learning Pipeline

Outline
• Introduction of Ontotext
• Clients, cases
• Wrap-up

Discovering Suspicious Relationships

• Have a database of locations, with part-of info
• Have a database with companies, with dependencies
• Define semantics for the relevant relationships:
– sub-region and control are transitive relationships
– Located-in is transitive over sub-region
• Define the semantics of suspicious relationships
CONSTRUCT { ?orgA my:suspiciousLink ?orgB } WHERE {
?orgA ptop:locatedIn ?x ; fibo:controls ?y .
?y fibo:controls ?orgB ; ptop:locatedIn ?z .
?orgB ptop:locatedIn ?x .
?z a ptop:OffshoreZone .
}
What It Takes to Make It Work?

Use Cases
• Investigating networks of linked entities
– As prerequisite for risk assessment and compliance research
• Risk assessment
– Tracing information about suspicious entities
– Identifying risk-indicators across multiple sources
– Identifying risks related to linked entities
– Determining exposure against a group of linked entities
• Compliance-related research
– Fraud detection, insider trading, etc.
• Searching in large policies and regulations
– See Open Policy
Oct 2014 #45

How: Semantic BI/Data-warehouses
• Imagine integrated database, which allows querying
across silted databases
– E.g. bond market data vs. risk assessment vs. equity markets vs. M&A
– A lot of duplicate data across various databases in different
departments of banks, and data is simply not linked or organized in a
unified data model
• Benefits compared to the mainstream technology:
– Lower cost of development and maintenance;
– Direct benefit from industry standards, using inference
– Real-time updates, unlike traditional data-warehousing, where
updates should often be scheduled overnight
– Support for a wide variety of analytical queries, which are far more
flexible than traditional approaches
Oct 2014 #46

Ontotext and top 3 Business Media
Profile
• Top 3 business media
• Focused both on B2C publishing and B2B
services
Goals
• Create a horizontal platform for both data
and content based on semantics and serve
all functionality through it
Challenges
• Critical part of the entire workflow
• Multiple development projects in parallel
with up to 2 months time between
inception and go live
• GraphDB used not only for data, but for
content storage as well
• Horizontal platform with focus on
organizations, people, GPEs and relations
between them
• Automatic extraction of all these concepts
and relationships
• Separate stream of work for a user behavior
based recommendation of relevant content
and data across the entire media
Oct 2014 #47

Reference Projects: BCA/Euromoney
• BCA/Euromoney Macroeconomics Reports
– Implementation of the Euromoney Semantic Platform
• Automatically generate metadata about:
– Markets, geo-political entities, economies, currencies, indicators, indices;
– Themes of the report;
– Economic and market conditions;
– Views of the economist with horizon, focus of the view, and prediction;
– Suggested trades of tradable objects (bonds, commodities, equities).
• Semantic indices powering various services:
– Live Charts – serving macro economics charts with the possibility to add
additional data series/indices;
– Macroeconomists dashboard of views with their objects, sentiment,
horizon, and agreement/disagreement.
Oct 2014 #48

Ontotext and Euromoney
Profile
• Euromoney Institutional Investor PLC, the
international online information and events
group
Goals
• Create a horizontal platform to serve 100
different publications
• create a new publishing and information
platform which would include the latest
authoring, storing, and display technologies
including, semantic annotation, search and a
triple store repository
Challenges
• Different domains covered
• Sophisticated content analytics incl.
Relation, template and scenario extraction
• Analytics of reports and news of various
domains
• Extraction of sophisticated macro economic
views on markets and market conditions;
trades, condition and trade horizons, assets,
asset allocations, etc.
• Multi-faceted search
• Completely new content and data
infrastructure
Oct 2014 #49

Wrap up
• Ontotext has a full stack of semantic technologies
• Triplestores combine beauties from NoSQL and SQL
• Inference fosters discovery in diverse dynamic data
• GraphDB is in a league on its own:
– Standard compliant – comprehensive support for OWL and SPARQL
– Efficient inference through the entire life-cycle of the data
– H igh-availability cluster architecture – proven and mature
– FTS and NoSQL Connectors for seamless integration
• End-to-end solution for Media and Publishing
– Authoring, curation and publishing through adaptive text-mining
• All the above proven with industry leaders

Semantic Technology in Publishing & Finance

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Semantic Technology in Publishing & Finance

Similaire à Semantic Technology in Publishing & Finance (20)

Plus de Vladimir Alexiev, PhD, PMP

Plus de Vladimir Alexiev, PhD, PMP (20)

Semantic Technology in Publishing & Finance

Notes de l'éditeur