Adam Bartusiak and Jörg Lässig | Semantic Processing for the Conversion of Unstructured Documents into Structured Information in the Enterprise Context
Similar to Adam Bartusiak and Jörg Lässig | Semantic Processing for the Conversion of Unstructured Documents into Structured Information in the Enterprise Context
Similar to Adam Bartusiak and Jörg Lässig | Semantic Processing for the Conversion of Unstructured Documents into Structured Information in the Enterprise Context (20)
Adam Bartusiak and Jörg Lässig | Semantic Processing for the Conversion of Unstructured Documents into Structured Information in the Enterprise Context
1. Enterprise Application Development Group
University of Applied Sciences Zittau/Görlitz
The NXTM Project
Development of a technology for live analysis of data
streams with regard to semantics and cross-linked data
structures
Adam Bartusiak M.Sc.
University of Applied Sciences Zittau/Görlitz
January 7, 2015
SEMANTiCS’16 - 13.09.2016
Adam Bartusiak M.Sc.
University of Applied Sciences Zittau/Görlitz
Semantic Processing for the Conversion of
Unstructured Documents into Structured
Information in the Enterprise Context
The NXTM research project
2. Enterprise Application Development Group
University of Applied Sciences Zittau/Görlitz
The NXTM Project
Development of a technology for live analysis of data
streams with regard to semantics and cross-linked data
structures
Adam Bartusiak M.Sc.
University of Applied Sciences Zittau/Görlitz
January 7, 2015
Agenda
• Motivation
• The NXTM Project
• Data analysis
• Search Engine
• Representation Layer
• Use case
Adam Bartusiak M.Sc. : The NXTM research project 2/10
3. Enterprise Application Development Group
University of Applied Sciences Zittau/Görlitz
The NXTM Project
Development of a technology for live analysis of data
streams with regard to semantics and cross-linked data
structures
Adam Bartusiak M.Sc.
University of Applied Sciences Zittau/Görlitz
January 7, 2015
Motivation
• unstructured data overload (80-90% of digital data)
• unstructured data is rather intended for human consumption only
• it holds useful knowledge that can be utilized for:
• trend analytics
• decision support
• problem solving
• discovering new facts and relations
• it can improve knowledge management within enterprise
• it helps SMEs gaining a sustainable competitive advantage on the market
Adam Bartusiak M.Sc. : The NXTM research project 3/10
4. Enterprise Application Development Group
University of Applied Sciences Zittau/Görlitz
The NXTM Project
Development of a technology for live analysis of data
streams with regard to semantics and cross-linked data
structures
Adam Bartusiak M.Sc.
University of Applied Sciences Zittau/Görlitz
January 7, 2015
The NXTM Project
• cooperation project between HSZG and an IT company from Dresden
• lifetime: January 2015 - October 2016
Adam Bartusiak M.Sc. : The NXTM research project 4/10
Goal:
Improving SMEs’ processes for extracting valuable business information from UD
5. Enterprise Application Development Group
University of Applied Sciences Zittau/Görlitz
The NXTM Project
Development of a technology for live analysis of data
streams with regard to semantics and cross-linked data
structures
Adam Bartusiak M.Sc.
University of Applied Sciences Zittau/Görlitz
January 7, 2015
The NXTM Project
• cooperation project between HSZG and an IT company from Dresden
• lifetime: January 2015 - October 2016
Goal:
Adam Bartusiak M.Sc. : The NXTM research project 4/10
Improving SMEs’ processes for extracting valuable business information from UD
• extraction of structured data from unstructured data from multiple resources:
• emails and text messages
• MS Office and PDF documents
• XML and HTML files
6. Enterprise Application Development Group
University of Applied Sciences Zittau/Görlitz
The NXTM Project
Development of a technology for live analysis of data
streams with regard to semantics and cross-linked data
structures
Adam Bartusiak M.Sc.
University of Applied Sciences Zittau/Görlitz
January 7, 2015
The NXTM Project
• cooperation project between HSZG and an IT company from Dresden
• lifetime: January 2015 - October 2016
Adam Bartusiak M.Sc. : The NXTM research project 4/10
Goal:
Improving SMEs’ processes for extracting valuable business information from UD
• extraction of structured data from unstructured data from multiple resources:
• emails and text messages
• MS Office and PDF documents
• XML and HTML files
• dynamic recognition and representation of linked information in documents
7. Enterprise Application Development Group
University of Applied Sciences Zittau/Görlitz
The NXTM Project
Development of a technology for live analysis of data
streams with regard to semantics and cross-linked data
structures
Adam Bartusiak M.Sc.
University of Applied Sciences Zittau/Görlitz
January 7, 2015
The NXTM Project
• cooperation project between HSZG and an IT company from Dresden
• lifetime: January 2015 - October 2016
Adam Bartusiak M.Sc. : The NXTM research project 4/10
Goal:
Improving SMEs’ processes for extracting valuable business information from UD
• extraction of structured data from unstructured data from multiple resources:
• emails and text messages
• MS Office and PDF documents
• XML and HTML files
• flexible and intuitive graphical user interface enabling easy access to the
analyzed data
• dynamic recognition and representation of linked information in documents
8. Enterprise Application Development Group
University of Applied Sciences Zittau/Görlitz
The NXTM Project
Development of a technology for live analysis of data
streams with regard to semantics and cross-linked data
structures
Adam Bartusiak M.Sc.
University of Applied Sciences Zittau/Görlitz
January 7, 2015
Data analysis
Data Input Interface
1. import of documents as JAVA objects from the input pipeline
Adam Bartusiak M.Sc. : The NXTM research project 5/10
9. Enterprise Application Development Group
University of Applied Sciences Zittau/Görlitz
The NXTM Project
Development of a technology for live analysis of data
streams with regard to semantics and cross-linked data
structures
Adam Bartusiak M.Sc.
University of Applied Sciences Zittau/Görlitz
January 7, 2015
Data analysis
Data Input Interface
NXTM Data and Text Analysis Engine
Metadata Analysis
Text Extraction
Segmentation
Morphology
Semantic Analysis
Similarity Analysis
1. import of documents as JAVA objects from the input pipeline
2. language identification, MIME-Type and metadata analysis
Adam Bartusiak M.Sc. : The NXTM research project 5/10
10. Enterprise Application Development Group
University of Applied Sciences Zittau/Görlitz
The NXTM Project
Development of a technology for live analysis of data
streams with regard to semantics and cross-linked data
structures
Adam Bartusiak M.Sc.
University of Applied Sciences Zittau/Görlitz
January 7, 2015
Data analysis
Data Input Interface
NXTM Data and Text Analysis Engine
Metadata Analysis
Text Extraction
Segmentation
Morphology
Semantic Analysis
Similarity Analysis
1. import of documents as JAVA objects from the input pipeline
2. language identification, MIME-Type and metadata analysis
3. NL processing in chained analysis engines and annotating semantic information
Adam Bartusiak M.Sc. : The NXTM research project 5/10
11. Enterprise Application Development Group
University of Applied Sciences Zittau/Görlitz
The NXTM Project
Development of a technology for live analysis of data
streams with regard to semantics and cross-linked data
structures
Adam Bartusiak M.Sc.
University of Applied Sciences Zittau/Görlitz
January 7, 2015
Data analysis
Data Input Interface
NXTM Data and Text Analysis Engine
Metadata Analysis
Text Extraction
Segmentation
Morphology
Semantic Analysis
Similarity Analysis
1. import of documents as JAVA objects from the input pipeline
2. language identification, MIME-Type and metadata analysis
3. NL processing in chained analysis engines and annotating semantic information
4. similarity calculation and document clustering
Adam Bartusiak M.Sc. : The NXTM research project 5/10
12. Enterprise Application Development Group
University of Applied Sciences Zittau/Görlitz
The NXTM Project
Development of a technology for live analysis of data
streams with regard to semantics and cross-linked data
structures
Adam Bartusiak M.Sc.
University of Applied Sciences Zittau/Görlitz
January 7, 2015
Data analysis
Data Input Interface Data Persistence Layer
NXTM Data and Text Analysis Engine
Metadata Analysis
Text Extraction
Segmentation
Morphology
Semantic Analysis
Similarity Analysis
DB
Mapper
Clustering Engine
1. import of documents as JAVA objects from the input pipeline
2. language identification, MIME-Type and metadata analysis
3. NL processing in chained analysis engines and annotating semantic information
4. similarity calculation and document clustering
5. storing extracted data in DB, updating search index
Adam Bartusiak M.Sc. : The NXTM research project 5/10
13. Enterprise Application Development Group
University of Applied Sciences Zittau/Görlitz
The NXTM Project
Development of a technology for live analysis of data
streams with regard to semantics and cross-linked data
structures
Adam Bartusiak M.Sc.
University of Applied Sciences Zittau/Görlitz
January 7, 2015
Data analysis
Data Input Interface Data Persistence Layer
NXTM Data and Text Analysis Engine
Metadata Analysis
Text Extraction
Segmentation
Morphology
Semantic Analysis
Similarity Analysis
Linked Open Data
Knowledge
Integrator
DB
Mapper
Clustering Engine
1. import of documents as JAVA objects from the input pipeline
2. language identification, MIME-Type and metadata analysis
3. NL processing in chained analysis engines and annotating semantic information
4. similarity calculation and document clustering
5. storing extracted data in DB, updating search index
6. mapping annotated entities with LOD resources
Adam Bartusiak M.Sc. : The NXTM research project 5/10
14. NXTM Item
• ID
• Type (DOC, ENT)
• Attribute []
• …
•
Attribute
• Predicate
• Value (NXTM_Item_ID; String)
• Provenance (NXTM_Item_ID)
• Confidence
• Access policy
Enterprise Application Development Group
University of Applied Sciences Zittau/Görlitz
The NXTM Project
Development of a technology for live analysis of data
streams with regard to semantics and cross-linked data
structures
Adam Bartusiak M.Sc.
University of Applied Sciences Zittau/Görlitz
January 7, 2015
Data analysis
1. import of documents as JAVA objects from the input pipeline
2. language identification, MIME-Type and metadata analysis
3. NL processing in chained analysis engines and annotating semantic information
4. similarity calculation and document clustering
5. storing extracted data in DB, updating search index
6. mapping annotated entities with LOD resources
Adam Bartusiak M.Sc. : The NXTM research project 5/10
15. Enterprise Application Development Group
University of Applied Sciences Zittau/Görlitz
The NXTM Project
Development of a technology for live analysis of data
streams with regard to semantics and cross-linked data
structures
Adam Bartusiak M.Sc.
University of Applied Sciences Zittau/Görlitz
January 7, 2015
Search Engine
Data Presistence Layer
Search query…
Semantic Search Machine
NXTM Search Layer
Field Value
ID NXTM_Item_ID
Content LuceneAnalyzer
Semantic SIREnAnalyzer
• direct queries to a DB for retrieving the
analysed data is an inefficient way of
searching information
• a semantic search machine can
effectively search for hierarchical data
• search engine is still subject of
research:
•
•
•
Clustering Engine
Results…
Adam Bartusiak M.Sc. : The NXTM research project 6/10
16. Enterprise Application Development Group
University of Applied Sciences Zittau/Görlitz
The NXTM Project
Development of a technology for live analysis of data
streams with regard to semantics and cross-linked data
structures
Adam Bartusiak M.Sc.
University of Applied Sciences Zittau/Görlitz
January 7, 2015
Representation Layer
• search results are represented as an interactive graph with
nodes and edges
• real time browsing of the graph enables the user to discover
other relevant sources of information and their dependencies
• d3js.org java-script library
Standalone Frontend
Plugins & Apps
NXTM Representation Layer
Document
Abstract
Lorem ipsum dolor
sit amet, consetetur
s a d i p s c i n g e l i t r,
sediam nonumy
eirmod temport…
Updated: 03.01.2003
Entity
Type: Person
Name: John Smith
Author of: XYZ
Title: XYZ
Adam Bartusiak M.Sc. : The NXTM research project 7/10
17. Enterprise Application Development Group
University of Applied Sciences Zittau/Görlitz
The NXTM Project
Development of a technology for live analysis of data
streams with regard to semantics and cross-linked data
structures
Adam Bartusiak M.Sc.
University of Applied Sciences Zittau/Görlitz
January 7, 2015
Use case
Lorem ipsum dolor sit
amet, consetetur NY elitr,
sed diam nonumy eirmod
tempor invidunt ut labore
et NY dolore magna
aliquyam erat, NY sed
diam voluptua. At vero
eos et accusam et justo
duo dolores NY et ea
rebum. Stet
#1
ipsum dolor sit amet.
Lorem NY ipsum dolor sit
a m e t , c o n s e t e t u r
sadipscing elitr, sed
diam nonumy eirmod
tempor invidunt ut labore
e t d o l o r e m a g n a
aliquyam erat, sed diam
voluptua. At vero eos et
#2
Entity
• ID #301
• type PLACE
• name NY (#1)
• name NY (#2)
Metadata
• createdIn NY
NXTM System
NXTM Item
• ID #1
• Type DOC
• Attribute [] (Metadata)
NXTM Item
• ID #2
• Type DOC
• Attribute [] (Metadata)
NXTM Item
• ID #301
• Type ENT
• Attribute [] (Metadata)
NXTM DB
Adam Bartusiak M.Sc. : The NXTM research project 8/10
18. Enterprise Application Development Group
University of Applied Sciences Zittau/Görlitz
The NXTM Project
Development of a technology for live analysis of data
streams with regard to semantics and cross-linked data
structures
Adam Bartusiak M.Sc.
University of Applied Sciences Zittau/Görlitz
January 7, 2015
Use case cont.
Query: New York
NXTM Results
ResultItem
• NXTM_ITEM_ID #1
• Score
• Attribute []
ResultItem
• NXTM_ITEM_ID #2
• Score
• Attribute []
Result Item
• NXTM_ITEM_ID #301
• Score
• Attribute []
Result Triples
Source; Target; Distance
ResultItem#1; ResultItem#2; DOC-DOC
ResultItem#1; ResultItem#3; DOC-ENT
ResultItem#2; ResultItem#3; DOC-ENT
ENT #301
DOC #1
DOC #2
• DOC-DOC -> f(TF*IDF Similarity, Lucene score)
• DOC-ENT -> f(Confidence score, Lucene score)
Adam
Bartusiak …
Person
DOC#45
Metadata
Keywords
Adam Bartusiak M.Sc. : The NXTM research project 9/10
19. Enterprise Application Development Group
University of Applied Sciences Zittau/Görlitz
The NXTM Project
Development of a technology for live analysis of data
streams with regard to semantics and cross-linked data
structures
Adam Bartusiak M.Sc.
University of Applied Sciences Zittau/Görlitz
January 7, 2015
Questions
Partners/Cooperations
a.bartusiak@hszg.de | ead.hszg.de