Triplestores and inference, applications in Finance, text-mining. Projects and solutions for financial media and publishers.
Keystone Industrial Panel, ISWC 2014, Riva del Garda, 18 Oct 2014.
Thanks to Atanas Kiryakov for this presentation, I just cut it to size.
Large-scale Reasoning with a Complex Cultural Heritage Ontology (CIDOC CRM) ...
Semantic Technology in Publishing & Finance
1. Semantic Technology
in Publishing & Finance
Triplestores and inference, applications in Finance, the GraphDB engine,
text-mining, projects and solutions for financial media and publishers
Keystone Industrial Panel
ISWC 2014, Riva del Garda, 18 Oct 2014
Semantic Technology in Publishing & Finance Oct 2014 #1
2. Outline
• Introduction to Ontotext
• Clients, cases
• Text Mining, Media and Publishing Solution
• SemTech applications in Finance
• Wrap-up
Semantic Technology in Publishing & Finance Oct 2014 #2
3. Ontotext
• Information management company providing text analysis,
data management and state-of-the-art semantic technology
• 75 employees, head quartered in Sofia, Bulgaria
• Sales presence in London, Washington DC, and Boston
• Clients include BBC, AstraZeneca, US DoD, OUP, Wiley, Getty…
• Over 400 person-years in R&D to create a one-stop shop for:
– Content enrichment
– Data management
– Graph database engine
• Open and standard
compliant technology:
– RDF(S), OWL, GATE, Sesame
Semantic Technology in Publishing & Finance Oct 2014 #3
5. Semantic Annotation
Bronchial Diseases
pmid:17714090
Clinical and experimental pharmacology …
Semantic Technology in Publishing & Finance #5
umls:C0035204
COPD
Respiration Disorders
umls:C0006261
Chronic Obstructive
Airway Diseases
Asthma umls:C000496
Ian A Yang
Oct 2014
6. Semantic Annotation
Semantic Annotation goes far beyond
tagging. It allows search using enrichment,
linking and rules to return explicit and
implicit results – complete intelligence.
Semantic Technology in Publishing & Finance
Content Enrichment
• Text Mining &
Classification
• Curation
• Quality Monitoring
Data Management
• Ontologies and
Semantic Annotation
• Web mining
• Identity Resolution
Graph Database
• Standards Based
• 24-7 Resiliency
• Hybrid Semantic
Queries & Search
Oct 2014 #6
7. What is RDF Good for?
• Metadata-based content management
– Metadata represents a re-usable result of content analytics
– It can be repurposed allowing for a wide range of applications
– Most of the search engines do analytics, but the results are not
explicit; so, they cannot be validated, refined and used by other
applications
• Linking text and structured data
– Allows structured, uniform and efficient access to diverse domain
models, taxonomies, dictionaries, reference databases
• Reference data management
– E.g. product catalogs and taxonomies that are too structured to be
managed with NoSQL, but too diverse and interconnected for SQL
• Using open linked data (LOD)
– A growing amount and diverse public data can be used in enterprise
Knowledge Management applications
Semantic Technology in Publishing & Finance Oct 2014 #7
8. LOD: Growing Exponentially
Linked Data Datasets
27 43 89 162
295
822
2,289
2007 2008 2009 2010 2011 2012 2013
• July 2013 stats: 2 289 datasets (http://stats.lod2.eu/)
• Growing exponentially (see the dotted trend line)
Semantic Technology in Publishing & Finance Oct 2014 #8
9. How Does Inference Help?
• Intelligent mapping of queries to data
– This matters a lot when an application should query a dataset
combined from 10+ sources, which evolve independently
– There is no way application developer can stay on top of all schemata
and all datasets, all the time
• Finding patterns and inferring new relationships
– Think of someone constantly looking for patterns that elicit new
relations, which can match patterns that elicit other relations …
– Or someone who goes deeper and deeper into finding new ways to
rewrite a query, over and over again, until all alternatives are
exhausted
• Get deeper results and more complete results
• Cheaper data integration, easier querying
Semantic Technology in Publishing & Finance Oct 2014 #9
11. Outline
• Introduction to Ontotext
• Clients, cases
• Text Mining, Media and Publishing Solution
• SemTech applications in Finance
• Wrap-up
Semantic Technology in Publishing & Finance Oct 2014 #11
12. BBC: The Perfect Application
Since year 2000 Semantic technology was striving for:
• Pertinent applications, a really good use case
• Real high-profile projects to prove its maturity
The “Dynamic Semantic Publishing” architecture
implemented by the BBC for its FIFA World cup 2010
web-site filled this gap!
It demonstrates:
• How RDF database serves well, where RDBMS fail to
• How text-mining and triplestores complement one another
• How inference adds value at a decent scale
• 24/7 live operation that cannot work without a functional triplestore
Semantic Technology in Publishing & Finance Oct 2014 #12
13. Ontotext and BBC
Profile
• Mass media broadcaster founded in 1922
• 23,000 employees and over 5 billion
pounds in annual revenue.
Goals
• Create a dynamic semantic publishing
platform that assembled web pages on-the-fly
using a variety of data sources
• Deliver highly relevant data to web site
visitors with sub-second response
Challenges
• BBC journalists author and publish content
which is then statistically rendered. The
costs and time to do this were high.
• Diverse content was difficult to navigate,
content re-use was not flexible
• User experience needed to be improved
Semantic Technology in Publishing & Finance
with relevant content
"The goal is to be able to more easily and
accurately aggregate content, find it and
share it across many sources. From these
simple relationships and building blocks you
can dynamically build up incredibly rich sites
and navigation on any platform."
John O’Donovan
Chief Technical Architect
Oct 2014 #13
14. BBC: The Perfect Application (ctd)
• The BBC’s FIFA World cup 2010 project was widely
recognized as the best showcase for SemTech
– It used OWLIM as a triplestore (chosen after a thorough evaluation)
– It triggered a wave of adoption of the technology
• The next milestone: London 2012 Olympic Games
– The two most important websites used the DSP architecture: the
official one, operated by Press Association, and the one of the BBC
– Ontotext text-mining technology was used for content enrichment
• Four years later this application pattern is still the
best use case
– And there are still no other triplestores that can survive such load,
judging by the LDBC Semantic Publishing Benchmark, public
information and feedback from the industry
Semantic Technology in Publishing & Finance Oct 2014 #14
15. Ontotext and AstraZeneca
Profile
• Global, Bio-pharma company
• $28 billion in sales in 2012
• $4 billion in R&D across three continents
Goals
• Efficient design of new clinical studies
• Quick access to all of the data
• Improved evidence based decision-making
• Strengthen the knowledge feedback loop
• Enable predictive science
Challenges
• Over 7,000 studies and 23,000 documents
Semantic Technology in Publishing & Finance
are difficult to obtain
• Searches returning 1,000 – 10,000 results
• Document repositories not designed for
reuse
• Tedious process to arrive at evidence
based decisions
Oct 2014 #15
17. Ontotext and LMI
Profile
• Established in 1961 to enable federal
Semantic Technology in Publishing & Finance
agencies
• Specializes in logistics, financial,
infrastructure & information management
Goals
• Unlock large collections of complex
documents
• Improve analyst productivity
• Create an application they can sell to US
Federal agencies
Challenges
• Analysts taking hours to find, download
and search documents, using inaccurate
keyword searches
• Needed a knowledge base to search
quickly and guide the analysts – highly
relevant searches
• Extracts knowledge from collection of
documents
• Uses GraphDB to intuitively search and filter
• Knowledge base used to suggest searches
• Hyper speed performance
• Huge savings in analyst time
• Accurate results
Oct 2014 #17
18. Some of our clients
#18
The most
popular
financial
newspaper
Semantic Technology in Publishing & Finance Oct 2014
19. Outline
• Introduction of Ontotext
• Clients, cases
• Text Mining, Media and Publishing Solution
• SemTech applications in Finance
• Wrap-up
Semantic Technology in Publishing & Finance Oct 2014 #19
20. Publishing and Media Solution
Semantic Technology in Publishing & Finance Oct 2014 #20
21. Solution Features
• Dedicated solutions for media and publishing
• Based on the Ontotext Semantic Platform
• Mature implementation and continuous adaptation
methodology
• Introducing advanced features to the authoring,
editorial and publishing phases of content and data
workflows
Semantic Technology in Publishing & Finance Oct 2014 #21
24. Authoring
Related assets – as you type
Related entities and concepts
Entity profiles and facts on the fly
Create higher value content at the same cost
Semantic Technology in Publishing & Oct 2014 #24
Finance
26. Curation
Continuous adaptation through editorial feedback
Query driven publishing templates
Dynamic re-purposing and reuse
New publishing products with the same content
Semantic Technology in Publishing & Oct 2014 #26
Finance
27. Example of Client Integrated Curation
Semantic Technology in Publishing & Oct 2014 #27
Finance
28. Example Curation Tool: PressAssociation
Semantic Technology in Publishing & Finance Oct 2014 #28
29. Monitoring and Curation Curation Tool
Semantic Technology in Publishing & Finance Oct 2014 #29
31. Publishing
Dynamic construction of products (e.g. topic
pages)
Personalized content streams
Semantics driven trend and user analytics
Behavior driven personal asset streams
Semantic Technology in Publishing & Finance Oct 2014 #31
32. User Behavior Tracking
perform
comments
votes
posts
preview
read
Article
contains leads to
read
leads to
preview
Search
Action
Result
Date
FTS Q. Tag
Cat
Tag set
results
cat
taxonomy
Search Log
-------------
-------------
-------------
-------------
-------------
Semantic Technology in Publishing & Finance Oct 2014 #32
33. Personalized Recommendations
User Profile
Behavioural
and
Contextual
Simil ar ity Reads
Semantic Technology in Publishing & Finance Oct 2014 #33
41. Design of Machine Learning Pipeline
Semantic Technology in Publishing & Finance Oct 2014 #41
42. Outline
• Introduction of Ontotext
• Clients, cases
• Text Mining, Media and Publishing Solution
• SemTech applications in Finance
• Wrap-up
Semantic Technology in Publishing & Finance Oct 2014 #42
44. • Have a database of locations, with part-of info
• Have a database with companies, with dependencies
• Define semantics for the relevant relationships:
– sub-region and control are transitive relationships
– Located-in is transitive over sub-region
• Define the semantics of suspicious relationships
CONSTRUCT { ?orgA my:suspiciousLink ?orgB } WHERE {
?orgA ptop:locatedIn ?x ; fibo:controls ?y .
?y fibo:controls ?orgB ; ptop:locatedIn ?z .
?orgB ptop:locatedIn ?x .
?z a ptop:OffshoreZone .
}
What It Takes to Make It Work?
Semantic Technology in Publishing & Finance Oct 2014 #44
45. Use Cases
• Investigating networks of linked entities
– As prerequisite for risk assessment and compliance research
• Risk assessment
– Tracing information about suspicious entities
– Identifying risk-indicators across multiple sources
– Identifying risks related to linked entities
– Determining exposure against a group of linked entities
• Compliance-related research
– Fraud detection, insider trading, etc.
• Searching in large policies and regulations
Semantic Technology in Publishing & Finance
– See Open Policy
Oct 2014 #45
46. How: Semantic BI/Data-warehouses
• Imagine integrated database, which allows querying
across silted databases
– E.g. bond market data vs. risk assessment vs. equity markets vs. M&A
– A lot of duplicate data across various databases in different
departments of banks, and data is simply not linked or organized in a
unified data model
• Benefits compared to the mainstream technology:
– Lower cost of development and maintenance;
– Direct benefit from industry standards, using inference
– Real-time updates, unlike traditional data-warehousing, where
updates should often be scheduled overnight
– Support for a wide variety of analytical queries, which are far more
flexible than traditional approaches
Semantic Technology in Publishing & Finance
Oct 2014 #46
47. Ontotext and top 3 Business Media
Profile
• Top 3 business media
• Focused both on B2C publishing and B2B
Semantic Technology in Publishing & Finance
services
Goals
• Create a horizontal platform for both data
and content based on semantics and serve
all functionality through it
Challenges
• Critical part of the entire workflow
• Multiple development projects in parallel
with up to 2 months time between
inception and go live
• GraphDB used not only for data, but for
content storage as well
• Horizontal platform with focus on
organizations, people, GPEs and relations
between them
• Automatic extraction of all these concepts
and relationships
• Separate stream of work for a user behavior
based recommendation of relevant content
and data across the entire media
Oct 2014 #47
48. Reference Projects: BCA/Euromoney
• BCA/Euromoney Macroeconomics Reports
– Implementation of the Euromoney Semantic Platform
• Automatically generate metadata about:
– Markets, geo-political entities, economies, currencies, indicators, indices;
– Themes of the report;
– Economic and market conditions;
– Views of the economist with horizon, focus of the view, and prediction;
– Suggested trades of tradable objects (bonds, commodities, equities).
• Semantic indices powering various services:
– Live Charts – serving macro economics charts with the possibility to add
additional data series/indices;
– Macroeconomists dashboard of views with their objects, sentiment,
horizon, and agreement/disagreement.
Semantic Technology in Publishing & Finance
Oct 2014 #48
49. Ontotext and Euromoney
Profile
• Euromoney Institutional Investor PLC, the
international online information and events
group
Goals
• Create a horizontal platform to serve 100
Semantic Technology in Publishing & Finance
different publications
• create a new publishing and information
platform which would include the latest
authoring, storing, and display technologies
including, semantic annotation, search and a
triple store repository
Challenges
• Different domains covered
• Sophisticated content analytics incl.
Relation, template and scenario extraction
• Analytics of reports and news of various
domains
• Extraction of sophisticated macro economic
views on markets and market conditions;
trades, condition and trade horizons, assets,
asset allocations, etc.
• Multi-faceted search
• Completely new content and data
infrastructure
Oct 2014 #49
50. Wrap up
• Ontotext has a full stack of semantic technologies
• Triplestores combine beauties from NoSQL and SQL
• Inference fosters discovery in diverse dynamic data
• GraphDB is in a league on its own:
– Standard compliant – comprehensive support for OWL and SPARQL
– Efficient inference through the entire life-cycle of the data
– H igh-availability cluster architecture – proven and mature
– FTS and NoSQL Connectors for seamless integration
• End-to-end solution for Media and Publishing
– Authoring, curation and publishing through adaptive text-mining
• All the above proven with industry leaders
Semantic Technology in Publishing & Finance Oct 2014 #50
Notes de l'éditeur
AstraZeneca is a world leader in bio-pharmaceuticals with over 28 billion in revenue and 4 billion invested in R&D. Innovation and creating great medicines are core values.
To continue on their path to success, they needed to design new clinical studies efficiently through instant access to all of their data. By doing this, they could improve evidence based decision making and create a knowledge feedback loop through accessing historical clinical studies and related documents.
But with over 7,000 studies and 23000 documents, the volume of unstructured data was overwhelming them. They would search for what they thought were relevant results and instead get anywhere between 1,000 and 10,000 results that needed review. The fact is their document repository was not designed for reuse. They had not extracted the meaning from their documents they needed. They had not created reusable meta data allowing for the search results to be highly targeted. The tedious process of document review was slowing down the innovation engine.
They needed to reduce the onerous manual effort, gain complete visibility, decease the time to locate knowledge and arrive at instant analytical results.
[click to build the slide]
Ontotext was able to extract meaning from the unstructured documents, optimize their knowledge repository for flexible semantic searches – searches that leveraged newly created metadata. The data was stored in a way where it was optimized for navigation and retrieval. To do this, we used vocabularies for drugs and biomarkers while also creating a master databases linking all the information together in OWL. We indexed all of the disambiguated data allowing users to find targeted results from context-based search.
In the end this created less patient & regulatory risk, used fewer resources to locate and analyze clinical data and enhanced the innovation process allowing AZ to create new drugs faster. All of their goals were met and they are using this system in production today.
To add steps 1, 2 and 3 and so on. Make it more interactive.
Add explanation to the slide
LMI was founded in 1961 to provide solutions to enable federal government agencies.
Specializes in logistics, acquisition and financial management, infrastructure management, information management and policy and program support.
Publishing solution covering 3 main phases of publishers workflows
Typical pipeline structure including gazetteers, rules, ML for tagging and disambiguation; RDF output
Main layers of the architecture, including data, analytics and UIs
Contextual authoring is the key message
A wire frame for a typical contextual recommendations of data and content
Curation as a part of the editorial process. Emphasis is on quality control and adaptation of the text analytics, but as well monitoring and instance data curation.
Client annotation/curation tool
Another example for annotation and disambiguation tool based on our APIs
Ontotext curation and monitoring module of annotation jobs
Continuous adaptation of the text analytics modules / concept extraction services.
Initial creation of gold standards and then adapting the ML models according to new examples and editorial feedback.
Main focus is on dynamic publishing of content based on metadata templates and on personalization driven by behavior and context modeling.
An example user behavior model
Behavior and context driven recommendations for a client.
Focus on methodologies for implementing a solution.
Business and Functional requirements
SME analysis
Definition of annotation types and main domain model concepts
Design of ETL jobs and annotation guidelines
Zoom in
Domain model design – Information architecture, a combination of the work of an architect and the knowledge of a subject matter expert.