SlideShare a Scribd company logo
1 of 23
A framework to develop
Knowledge Based Systems
by Vinay Bhat and Yuvraj Singh Bawa
Mentored by Srinidhi Kulkarni
Siemens Corporate TechnologyRestricted © Siemens AG 2016
Unrestricted © Siemens AG 2016
April 2016Page 2 Corporate Technology
Knowledge Based Systems
•Knowledge based systems:
 computer programs that reason and use a knowledge base to solve problems.
 two types of sub-systems: knowledge base and an inference engine.
 store knowledge in an appropriate knowledge representational language.
 field born out of Artificial Intelligence, and the need to manage big data
Server
(Backend APIs)
Client/User
Interface
(Frontend APIs)
Inference
Engine
Knowledge
Base
Unrestricted © Siemens AG 2016
April 2016Page 3 Corporate Technology
Typical KBS Architecture
Unrestricted © Siemens AG 2016
April 2016Page 4 Corporate Technology
Knowledge Representation
Various kinds of knowledge that can be captured
•Natural Language
• Meant for communication rather than representation. Nonetheless, it is important to capture this knowledge.
•Documents
• Such as research articles, online resources and presentations
• Data files
• CAD data, Simulation data, etc.
These need to be translated into indexable, searchable and more concrete representation forms in order to build a KBS.
Unrestricted © Siemens AG 2016
April 2016Page 5 Corporate Technology
Propositional Logic is a precise way of representation of statements
Propositional Logic
•Proposition: declarative sentence which is either true or false
• Example: “It rained today.”
•Symbols/variables:
• P, Q, S, … (atomic sentences)
•Sentences are combined by connectives:
• ^ … and Conjunction
• ∨ … or Disjunction
• ⇒ … implies Implication
• ⇔ … equivalence Double Implication
• ¬ … not Negation
•Inference:
• to show that a proposition α follows from a KB, that is, KB |= α, we show that (KB ∧ ¬α) is unsatisfiable
Unrestricted © Siemens AG 2016
April 2016Page 6 Corporate Technology
First Order Logic represents relations between objects
•Propositional logic assumes world contains facts,
•First order logic (like natural language) assumes the world contains…
• Objects: people, houses, numbers, colors, cricket games, …
• Relations: red, round, prime, brother of, bigger than, part of, …
• Functions: father of, best friend, one more than, …
•Syntax of FOL:
• Constants: KingJohn, 2, Siemens, …
• Predicates: Brother, >, …
• Functions: Sqrt, LeftLegOf, …
• Variables: x, y, a, b
• Connectives: , , , , 
• Quantifiers: ,  (for all, there exists)
• Everyone at Siemens is smart: x At(x, Siemens)  Smart(x)
• Somebody at Siemens is procrastinating:  x At(x, Siemens)  Procrastinate(x)
First Order Logic
A model containing five objects, two binary relations,
three unary relations (indicated by labels on the objects),
and one unary function, left-leg.
Unrestricted © Siemens AG 2016
April 2016Page 7 Corporate Technology
NLP is used to translate language to first order logic
Language to meaning
First Order LogictoNatural Language
... through Natural Language Processing
Unrestricted © Siemens AG 2016
April 2016Page 8 Corporate Technology
Natural Language Processing Tool Kit
About NLP and NLTK
•Natural Language Processing:
• field of Computer Science, Artificial Intelligence, and Computational Linguistics
• concerned with the interactions between computers and human languages
• started in 1950, the problems still remain:
• How do we capture meaning from human languages?
• How do we reason on them?
• How do we translate human language to something the machine can understand?
•Natural Language Tool Kit:
• provides interfaces to >50 corpora and lexical resources
• text processing libraries for classification, tokenization, tagging, parsing
and semantic reasoning
• visual demos for parsers
Unrestricted © Siemens AG 2016
April 2016Page 9 Corporate Technology
Translator subsystem extracts deep meaning of sentences
Natural Language to First Order Logic Pipeline
Tokenization
Syntactic
Analysis
Parts-of-speech
Tagging
Semantic
Generation
Semantic
Composition
• Sentence:
•“Cyril barks”
•Tokens:
•[“Cyril”, “barks”]
•Parts of speech:
•“Cyril” => “Proper Noun”, “barks” => “Verb”
•First-order logic expression:
•bark(cyril)
Unrestricted © Siemens AG 2016
April 2016Page 10 Corporate Technology
Tokenization
Tokenization
Tokenization
Syntactic
Analysis
Parts-of-speech
Tagging
Semantic
Generation
Semantic
Composition
•First step of language processing
•Break up the string into words and punctuation
•Remove whitespace, line breaks and blank lines
•Get discrete units of words in the form of a list
Unrestricted © Siemens AG 2016
April 2016Page 11 Corporate Technology
Parts-of-speech Tagging
Categorizing and Tagging Words
Tokenization
Syntactic
Analysis
Parts-of-speech
Tagging
Semantic
Generation
Semantic
Composition
•Classify words into categories for language processing
•Example: Noun, Verb, Determiner, etc
•Called parts-of-speech
•Process of classifying words into their POS and labeling:
•POS Tagger
•Taggers need to be trained
•Also called word-classes or lexical categories
•Common tags:
•NN => Noun
•NNP => Proper Noun
•VB => Verb
•DT => Determiner (the, at, …)
•P => Pronoun
Unrestricted © Siemens AG 2016
April 2016Page 12 Corporate Technology
Syntactic Analysis
Analyzing Sentence Structure
Tokenization
Syntactic
Analysis
Parts-of-speech
Tagging
Semantic
Generation
Semantic
Composition
•Sentence structure determined through Grammar
•Context Free Grammar most widely used
•Called Productions
•Example:
•S -> NP VP
•NP -> DT NN
•NP -> NNP
•VP -> IV
•Process of identifying and constructing sentence structure:
•Parsing
•Recursive descent parsing
•Shift reduce parsing
•Multiple parse trees => Ambiguous sentence
Unrestricted © Siemens AG 2016
April 2016Page 13 Corporate Technology
Syntactic Analysis
Recursive Descent Parser
Tokenization
Syntactic
Analysis
Parts-of-speech
Tagging
Semantic
Generation
Semantic
Composition
Six Stages of a Recursive Descent Parser: the parser begins with a tree consisting of the node S;
at each stage it consults the grammar to find a production that can be used to enlarge the tree;
when a lexical production is encountered, its word is compared against the input;
after a complete parse has been found, the parser backtracks to look for more parses.
Unrestricted © Siemens AG 2016
April 2016Page 14 Corporate Technology
Syntactic Analysis
Excerpt from Grammar File
Tokenization
Syntactic
Analysis
Parts-of-speech
Tagging
Semantic
Generation
Semantic
Composition
Unrestricted © Siemens AG 2016
April 2016Page 15 Corporate Technology
Semantic Generation
Analyzing the Meaning of Sentences
Tokenization
Syntactic
Analysis
Parts-of-speech
Tagging
Semantic
Generation
Semantic
Composition
•Semantics refer to the meaning of something
•Lambda calculus is used to represent meaning
•Generation of “lambda calculus” terms for each individual word
•Generation of grammar terms
Unrestricted © Siemens AG 2016
April 2016Page 16 Corporate Technology
Semantic Generation
Lambda Calculus
Tokenization
Syntactic
Analysis
Parts-of-speech
Tagging
Semantic
Generation
Semantic
Composition
•Rules for manipulating strings of symbols in the language
• term:
• variable
• term term
• (term)
• λ variable . term
•Lambda calculus expressions for each word:
•“Dev” => P.P(Dev)
•“walks” => xy.walks(x,y)
•“happy” => x.happy(x)
•β-reduction: computation in lambda calculus
• (x. M)N  M [ x N ]
•Replace all x’s in M with N
•Example:
•(x.x+3)10 => (10+3) => 13
Function Application
Function Abstraction

Unrestricted © Siemens AG 2016
April 2016Page 17 Corporate Technology
Semantic Composition
Composing meaning of a sentence from words
Tokenization
Syntactic
Analysis
Parts-of-speech
Tagging
Semantic
Generation
Semantic
Composition
•Principle of composition:
• meaning of a whole is a function of the meanings of the
parts and of the way they are syntactically combined.
•Semantic grammar rules define how word semantics combine:
•Example:
Sentence: Dev runs
S[SEM = <app(?subj,?vp)>] -> NP[SEM=?subj] VP[SEM=?vp]
If VP semantics : x.run(x)
NP semantics: P.P(Dev)
Hence, total sentence semantic:
(P.P(Dev))(x.run(x)) =>
(x.run(x))(Dev) =>
(run(Dev))
Final sentence semantics: run(Dev)
•Using parse tree + grammar file, individual word semantics
β-reduce to the final sentence semantics
Excerpt from grammar file
Unrestricted © Siemens AG 2016
April 2016Page 18 Corporate Technology
The key challenge in parsing PDFs is its unstructured nature
•The other kind of knowledge that we have attempted to capture is from documents, such as scientific literature, which are mostly in PDF
format.
•The main challenge: Processing PDF documents
•Unlike web-pages, or XML-based documents that follow logical representation, PDFs follow a visual representation scheme
•Processing such unstructured data is almost impossible and thus PDFs need to be converted to an XML-based file first.
Knowledge capture from PDF Documents
Unrestricted © Siemens AG 2016
April 2016Page 19 Corporate Technology
PDFs are converted to HTML and then parsed.
•A PDF to HTML converter needs to be implemented.
•Available technologies:
• pdf2htmlEx (open source, implemented in C++, Python and JavaScript)
• PDFMiner (open source, implemented in Python)
• However, these are not perfect and often produce absurd results.
• The next step is to parse the HTML document.
• We have written a JavaScript library for this, Beautiful Soup JS, based on the Python library of the same name.
• Advantages of Beautiful Soup JS:
• Client-side parsing
• Light-weight (~10kB, compared to Beautiful Soup Python which is 200kB)
Proposed Solution
Unrestricted © Siemens AG 2016
April 2016Page 20 Corporate Technology
We have developed APIs for extracting text and images from PDFs
•The Poppler package in Python is used for:
• Extracting images from PDFs
• Extracting raw text from PDFs
• Beautiful Soup JS is used for:
• Extracting data from online articles
• Special Wikipedia parsing API to demonstrate how this can be used for specific websites
Currently Implemented APIs:
Unrestricted © Siemens AG 2016
April 2016Page 21 Corporate Technology
An overview of the proposed KBS Framework
The framework is designed keeping in mind four levels of interaction:
• Builds and maintains the various APIs
Developer
• Builds KBS for a specific application using the available APIs
Application Builder
• Feeds data and manages access rights of the built KBS software
Admin
• Queries the data on demand and if allowed, feeds data to the KBS.
KBS User
Unrestricted © Siemens AG 2016
April 2016Page 22 Corporate Technology
Future Work and Scope
Developing APIs to
• Obtain meaningful HTML output from PDFs
• Process CAD data, to design Knowledge based engineering systems for automation in design
•To process other documents such as presentations, spreadsheets
•To process speech data
Integration of document information retrieval and NLP
• Extract information out of documents, and transform them to First Order Logic
• Integrate the document parsing and translator subsystem
• Add wider approach to NLP
• Survey more accurate methods such as Neural Networks and Machine Learning
Thank You

More Related Content

Viewers also liked

Knowledge Sharing by means of Microblogging at Siemens, Building Technologies...
Knowledge Sharing by means of Microblogging at Siemens, Building Technologies...Knowledge Sharing by means of Microblogging at Siemens, Building Technologies...
Knowledge Sharing by means of Microblogging at Siemens, Building Technologies...
Alexander Stocker
 
Siemens in Spain
Siemens in SpainSiemens in Spain
Siemens in Spain
Omer Malik
 
Analytics in Learning and Knowledge - George Siemens
Analytics in Learning and Knowledge - George SiemensAnalytics in Learning and Knowledge - George Siemens
Analytics in Learning and Knowledge - George Siemens
OpenKnowledge srl
 

Viewers also liked (13)

Knowledge Sharing by means of Microblogging at Siemens, Building Technologies...
Knowledge Sharing by means of Microblogging at Siemens, Building Technologies...Knowledge Sharing by means of Microblogging at Siemens, Building Technologies...
Knowledge Sharing by means of Microblogging at Siemens, Building Technologies...
 
Using Learning Analytics to Help Flip the Classroom
Using Learning Analytics to Help Flip the ClassroomUsing Learning Analytics to Help Flip the Classroom
Using Learning Analytics to Help Flip the Classroom
 
Siemens' Knowledge Journey
Siemens' Knowledge JourneySiemens' Knowledge Journey
Siemens' Knowledge Journey
 
Siemens ShareNet
Siemens ShareNetSiemens ShareNet
Siemens ShareNet
 
Siemens in Spain
Siemens in SpainSiemens in Spain
Siemens in Spain
 
Analytics in Learning and Knowledge - George Siemens
Analytics in Learning and Knowledge - George SiemensAnalytics in Learning and Knowledge - George Siemens
Analytics in Learning and Knowledge - George Siemens
 
Wissensmanagement 4.2
Wissensmanagement 4.2Wissensmanagement 4.2
Wissensmanagement 4.2
 
DMC Siemens Automation Summit 2014 Presentation: Extending S7 PLC Through WinAC
DMC Siemens Automation Summit 2014 Presentation: Extending S7 PLC Through WinACDMC Siemens Automation Summit 2014 Presentation: Extending S7 PLC Through WinAC
DMC Siemens Automation Summit 2014 Presentation: Extending S7 PLC Through WinAC
 
Developing Knowledge-Based Systems
Developing Knowledge-Based SystemsDeveloping Knowledge-Based Systems
Developing Knowledge-Based Systems
 
Knowledge Based Systems -Artificial Intelligence by Priti Srinivas Sajja S P...
Knowledge Based Systems -Artificial Intelligence  by Priti Srinivas Sajja S P...Knowledge Based Systems -Artificial Intelligence  by Priti Srinivas Sajja S P...
Knowledge Based Systems -Artificial Intelligence by Priti Srinivas Sajja S P...
 
Knowledge-based Systems
Knowledge-based SystemsKnowledge-based Systems
Knowledge-based Systems
 
Knowledge based systems
Knowledge based systemsKnowledge based systems
Knowledge based systems
 
Siemens ShareNe
Siemens ShareNeSiemens ShareNe
Siemens ShareNe
 

Similar to Knowledge_Based_Systems_Siemens

Programming Languages #devcon2013
Programming Languages #devcon2013Programming Languages #devcon2013
Programming Languages #devcon2013
Iván Montes
 
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
ScyllaDB
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan
 

Similar to Knowledge_Based_Systems_Siemens (20)

European SharePoint Conference 2017 Summary
European SharePoint Conference 2017 SummaryEuropean SharePoint Conference 2017 Summary
European SharePoint Conference 2017 Summary
 
Introduction to the source{d} Stack
Introduction to the source{d} Stack Introduction to the source{d} Stack
Introduction to the source{d} Stack
 
computer-science_engineering_principles-of-programming-languages_introduction...
computer-science_engineering_principles-of-programming-languages_introduction...computer-science_engineering_principles-of-programming-languages_introduction...
computer-science_engineering_principles-of-programming-languages_introduction...
 
SPS Monaco 2017 - The Lay of the Land of Client-Side Development circa 2017
SPS Monaco 2017 - The Lay of the Land of Client-Side Development circa 2017SPS Monaco 2017 - The Lay of the Land of Client-Side Development circa 2017
SPS Monaco 2017 - The Lay of the Land of Client-Side Development circa 2017
 
AWS Artificial Intelligence Day - Toronto
AWS Artificial Intelligence Day - TorontoAWS Artificial Intelligence Day - Toronto
AWS Artificial Intelligence Day - Toronto
 
Amazon Web Services - Strategy and Current Offering
Amazon Web Services - Strategy and Current OfferingAmazon Web Services - Strategy and Current Offering
Amazon Web Services - Strategy and Current Offering
 
Tech 802: Data, Databases & XML
Tech 802: Data, Databases & XMLTech 802: Data, Databases & XML
Tech 802: Data, Databases & XML
 
.NET per la Data Science e oltre
.NET per la Data Science e oltre.NET per la Data Science e oltre
.NET per la Data Science e oltre
 
Programming Languages #devcon2013
Programming Languages #devcon2013Programming Languages #devcon2013
Programming Languages #devcon2013
 
Sudipta_Mukherjee_Resume-Nov_2022.pdf
Sudipta_Mukherjee_Resume-Nov_2022.pdfSudipta_Mukherjee_Resume-Nov_2022.pdf
Sudipta_Mukherjee_Resume-Nov_2022.pdf
 
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
 
Sanjeev rai
Sanjeev raiSanjeev rai
Sanjeev rai
 
Maintainable Machine Learning Products
Maintainable Machine Learning ProductsMaintainable Machine Learning Products
Maintainable Machine Learning Products
 
Sudipta_Mukherjee_Resume_APR_2023.pdf
Sudipta_Mukherjee_Resume_APR_2023.pdfSudipta_Mukherjee_Resume_APR_2023.pdf
Sudipta_Mukherjee_Resume_APR_2023.pdf
 
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
 
Let Writers Write: Automating the Boring Stuff for Our Docs Team
Let Writers Write: Automating the Boring Stuff for Our Docs TeamLet Writers Write: Automating the Boring Stuff for Our Docs Team
Let Writers Write: Automating the Boring Stuff for Our Docs Team
 
Compiler design Introduction
Compiler design IntroductionCompiler design Introduction
Compiler design Introduction
 
Compiler design
Compiler designCompiler design
Compiler design
 
Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...
Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...
Way #5 Don’t end up in a ditch because you weren’t aware of roadblocks in you...
 

Knowledge_Based_Systems_Siemens

  • 1. A framework to develop Knowledge Based Systems by Vinay Bhat and Yuvraj Singh Bawa Mentored by Srinidhi Kulkarni Siemens Corporate TechnologyRestricted © Siemens AG 2016
  • 2. Unrestricted © Siemens AG 2016 April 2016Page 2 Corporate Technology Knowledge Based Systems •Knowledge based systems:  computer programs that reason and use a knowledge base to solve problems.  two types of sub-systems: knowledge base and an inference engine.  store knowledge in an appropriate knowledge representational language.  field born out of Artificial Intelligence, and the need to manage big data Server (Backend APIs) Client/User Interface (Frontend APIs) Inference Engine Knowledge Base
  • 3. Unrestricted © Siemens AG 2016 April 2016Page 3 Corporate Technology Typical KBS Architecture
  • 4. Unrestricted © Siemens AG 2016 April 2016Page 4 Corporate Technology Knowledge Representation Various kinds of knowledge that can be captured •Natural Language • Meant for communication rather than representation. Nonetheless, it is important to capture this knowledge. •Documents • Such as research articles, online resources and presentations • Data files • CAD data, Simulation data, etc. These need to be translated into indexable, searchable and more concrete representation forms in order to build a KBS.
  • 5. Unrestricted © Siemens AG 2016 April 2016Page 5 Corporate Technology Propositional Logic is a precise way of representation of statements Propositional Logic •Proposition: declarative sentence which is either true or false • Example: “It rained today.” •Symbols/variables: • P, Q, S, … (atomic sentences) •Sentences are combined by connectives: • ^ … and Conjunction • ∨ … or Disjunction • ⇒ … implies Implication • ⇔ … equivalence Double Implication • ¬ … not Negation •Inference: • to show that a proposition α follows from a KB, that is, KB |= α, we show that (KB ∧ ¬α) is unsatisfiable
  • 6. Unrestricted © Siemens AG 2016 April 2016Page 6 Corporate Technology First Order Logic represents relations between objects •Propositional logic assumes world contains facts, •First order logic (like natural language) assumes the world contains… • Objects: people, houses, numbers, colors, cricket games, … • Relations: red, round, prime, brother of, bigger than, part of, … • Functions: father of, best friend, one more than, … •Syntax of FOL: • Constants: KingJohn, 2, Siemens, … • Predicates: Brother, >, … • Functions: Sqrt, LeftLegOf, … • Variables: x, y, a, b • Connectives: , , , ,  • Quantifiers: ,  (for all, there exists) • Everyone at Siemens is smart: x At(x, Siemens)  Smart(x) • Somebody at Siemens is procrastinating:  x At(x, Siemens)  Procrastinate(x) First Order Logic A model containing five objects, two binary relations, three unary relations (indicated by labels on the objects), and one unary function, left-leg.
  • 7. Unrestricted © Siemens AG 2016 April 2016Page 7 Corporate Technology NLP is used to translate language to first order logic Language to meaning First Order LogictoNatural Language ... through Natural Language Processing
  • 8. Unrestricted © Siemens AG 2016 April 2016Page 8 Corporate Technology Natural Language Processing Tool Kit About NLP and NLTK •Natural Language Processing: • field of Computer Science, Artificial Intelligence, and Computational Linguistics • concerned with the interactions between computers and human languages • started in 1950, the problems still remain: • How do we capture meaning from human languages? • How do we reason on them? • How do we translate human language to something the machine can understand? •Natural Language Tool Kit: • provides interfaces to >50 corpora and lexical resources • text processing libraries for classification, tokenization, tagging, parsing and semantic reasoning • visual demos for parsers
  • 9. Unrestricted © Siemens AG 2016 April 2016Page 9 Corporate Technology Translator subsystem extracts deep meaning of sentences Natural Language to First Order Logic Pipeline Tokenization Syntactic Analysis Parts-of-speech Tagging Semantic Generation Semantic Composition • Sentence: •“Cyril barks” •Tokens: •[“Cyril”, “barks”] •Parts of speech: •“Cyril” => “Proper Noun”, “barks” => “Verb” •First-order logic expression: •bark(cyril)
  • 10. Unrestricted © Siemens AG 2016 April 2016Page 10 Corporate Technology Tokenization Tokenization Tokenization Syntactic Analysis Parts-of-speech Tagging Semantic Generation Semantic Composition •First step of language processing •Break up the string into words and punctuation •Remove whitespace, line breaks and blank lines •Get discrete units of words in the form of a list
  • 11. Unrestricted © Siemens AG 2016 April 2016Page 11 Corporate Technology Parts-of-speech Tagging Categorizing and Tagging Words Tokenization Syntactic Analysis Parts-of-speech Tagging Semantic Generation Semantic Composition •Classify words into categories for language processing •Example: Noun, Verb, Determiner, etc •Called parts-of-speech •Process of classifying words into their POS and labeling: •POS Tagger •Taggers need to be trained •Also called word-classes or lexical categories •Common tags: •NN => Noun •NNP => Proper Noun •VB => Verb •DT => Determiner (the, at, …) •P => Pronoun
  • 12. Unrestricted © Siemens AG 2016 April 2016Page 12 Corporate Technology Syntactic Analysis Analyzing Sentence Structure Tokenization Syntactic Analysis Parts-of-speech Tagging Semantic Generation Semantic Composition •Sentence structure determined through Grammar •Context Free Grammar most widely used •Called Productions •Example: •S -> NP VP •NP -> DT NN •NP -> NNP •VP -> IV •Process of identifying and constructing sentence structure: •Parsing •Recursive descent parsing •Shift reduce parsing •Multiple parse trees => Ambiguous sentence
  • 13. Unrestricted © Siemens AG 2016 April 2016Page 13 Corporate Technology Syntactic Analysis Recursive Descent Parser Tokenization Syntactic Analysis Parts-of-speech Tagging Semantic Generation Semantic Composition Six Stages of a Recursive Descent Parser: the parser begins with a tree consisting of the node S; at each stage it consults the grammar to find a production that can be used to enlarge the tree; when a lexical production is encountered, its word is compared against the input; after a complete parse has been found, the parser backtracks to look for more parses.
  • 14. Unrestricted © Siemens AG 2016 April 2016Page 14 Corporate Technology Syntactic Analysis Excerpt from Grammar File Tokenization Syntactic Analysis Parts-of-speech Tagging Semantic Generation Semantic Composition
  • 15. Unrestricted © Siemens AG 2016 April 2016Page 15 Corporate Technology Semantic Generation Analyzing the Meaning of Sentences Tokenization Syntactic Analysis Parts-of-speech Tagging Semantic Generation Semantic Composition •Semantics refer to the meaning of something •Lambda calculus is used to represent meaning •Generation of “lambda calculus” terms for each individual word •Generation of grammar terms
  • 16. Unrestricted © Siemens AG 2016 April 2016Page 16 Corporate Technology Semantic Generation Lambda Calculus Tokenization Syntactic Analysis Parts-of-speech Tagging Semantic Generation Semantic Composition •Rules for manipulating strings of symbols in the language • term: • variable • term term • (term) • λ variable . term •Lambda calculus expressions for each word: •“Dev” => P.P(Dev) •“walks” => xy.walks(x,y) •“happy” => x.happy(x) •β-reduction: computation in lambda calculus • (x. M)N  M [ x N ] •Replace all x’s in M with N •Example: •(x.x+3)10 => (10+3) => 13 Function Application Function Abstraction 
  • 17. Unrestricted © Siemens AG 2016 April 2016Page 17 Corporate Technology Semantic Composition Composing meaning of a sentence from words Tokenization Syntactic Analysis Parts-of-speech Tagging Semantic Generation Semantic Composition •Principle of composition: • meaning of a whole is a function of the meanings of the parts and of the way they are syntactically combined. •Semantic grammar rules define how word semantics combine: •Example: Sentence: Dev runs S[SEM = <app(?subj,?vp)>] -> NP[SEM=?subj] VP[SEM=?vp] If VP semantics : x.run(x) NP semantics: P.P(Dev) Hence, total sentence semantic: (P.P(Dev))(x.run(x)) => (x.run(x))(Dev) => (run(Dev)) Final sentence semantics: run(Dev) •Using parse tree + grammar file, individual word semantics β-reduce to the final sentence semantics Excerpt from grammar file
  • 18. Unrestricted © Siemens AG 2016 April 2016Page 18 Corporate Technology The key challenge in parsing PDFs is its unstructured nature •The other kind of knowledge that we have attempted to capture is from documents, such as scientific literature, which are mostly in PDF format. •The main challenge: Processing PDF documents •Unlike web-pages, or XML-based documents that follow logical representation, PDFs follow a visual representation scheme •Processing such unstructured data is almost impossible and thus PDFs need to be converted to an XML-based file first. Knowledge capture from PDF Documents
  • 19. Unrestricted © Siemens AG 2016 April 2016Page 19 Corporate Technology PDFs are converted to HTML and then parsed. •A PDF to HTML converter needs to be implemented. •Available technologies: • pdf2htmlEx (open source, implemented in C++, Python and JavaScript) • PDFMiner (open source, implemented in Python) • However, these are not perfect and often produce absurd results. • The next step is to parse the HTML document. • We have written a JavaScript library for this, Beautiful Soup JS, based on the Python library of the same name. • Advantages of Beautiful Soup JS: • Client-side parsing • Light-weight (~10kB, compared to Beautiful Soup Python which is 200kB) Proposed Solution
  • 20. Unrestricted © Siemens AG 2016 April 2016Page 20 Corporate Technology We have developed APIs for extracting text and images from PDFs •The Poppler package in Python is used for: • Extracting images from PDFs • Extracting raw text from PDFs • Beautiful Soup JS is used for: • Extracting data from online articles • Special Wikipedia parsing API to demonstrate how this can be used for specific websites Currently Implemented APIs:
  • 21. Unrestricted © Siemens AG 2016 April 2016Page 21 Corporate Technology An overview of the proposed KBS Framework The framework is designed keeping in mind four levels of interaction: • Builds and maintains the various APIs Developer • Builds KBS for a specific application using the available APIs Application Builder • Feeds data and manages access rights of the built KBS software Admin • Queries the data on demand and if allowed, feeds data to the KBS. KBS User
  • 22. Unrestricted © Siemens AG 2016 April 2016Page 22 Corporate Technology Future Work and Scope Developing APIs to • Obtain meaningful HTML output from PDFs • Process CAD data, to design Knowledge based engineering systems for automation in design •To process other documents such as presentations, spreadsheets •To process speech data Integration of document information retrieval and NLP • Extract information out of documents, and transform them to First Order Logic • Integrate the document parsing and translator subsystem • Add wider approach to NLP • Survey more accurate methods such as Neural Networks and Machine Learning