More Related Content
Similar to Knowledge_Based_Systems_Siemens (20)
Knowledge_Based_Systems_Siemens
- 1. A framework to develop
Knowledge Based Systems
by Vinay Bhat and Yuvraj Singh Bawa
Mentored by Srinidhi Kulkarni
Siemens Corporate TechnologyRestricted © Siemens AG 2016
- 2. Unrestricted © Siemens AG 2016
April 2016Page 2 Corporate Technology
Knowledge Based Systems
•Knowledge based systems:
computer programs that reason and use a knowledge base to solve problems.
two types of sub-systems: knowledge base and an inference engine.
store knowledge in an appropriate knowledge representational language.
field born out of Artificial Intelligence, and the need to manage big data
Server
(Backend APIs)
Client/User
Interface
(Frontend APIs)
Inference
Engine
Knowledge
Base
- 4. Unrestricted © Siemens AG 2016
April 2016Page 4 Corporate Technology
Knowledge Representation
Various kinds of knowledge that can be captured
•Natural Language
• Meant for communication rather than representation. Nonetheless, it is important to capture this knowledge.
•Documents
• Such as research articles, online resources and presentations
• Data files
• CAD data, Simulation data, etc.
These need to be translated into indexable, searchable and more concrete representation forms in order to build a KBS.
- 5. Unrestricted © Siemens AG 2016
April 2016Page 5 Corporate Technology
Propositional Logic is a precise way of representation of statements
Propositional Logic
•Proposition: declarative sentence which is either true or false
• Example: “It rained today.”
•Symbols/variables:
• P, Q, S, … (atomic sentences)
•Sentences are combined by connectives:
• ^ … and Conjunction
• ∨ … or Disjunction
• ⇒ … implies Implication
• ⇔ … equivalence Double Implication
• ¬ … not Negation
•Inference:
• to show that a proposition α follows from a KB, that is, KB |= α, we show that (KB ∧ ¬α) is unsatisfiable
- 6. Unrestricted © Siemens AG 2016
April 2016Page 6 Corporate Technology
First Order Logic represents relations between objects
•Propositional logic assumes world contains facts,
•First order logic (like natural language) assumes the world contains…
• Objects: people, houses, numbers, colors, cricket games, …
• Relations: red, round, prime, brother of, bigger than, part of, …
• Functions: father of, best friend, one more than, …
•Syntax of FOL:
• Constants: KingJohn, 2, Siemens, …
• Predicates: Brother, >, …
• Functions: Sqrt, LeftLegOf, …
• Variables: x, y, a, b
• Connectives: , , , ,
• Quantifiers: , (for all, there exists)
• Everyone at Siemens is smart: x At(x, Siemens) Smart(x)
• Somebody at Siemens is procrastinating: x At(x, Siemens) Procrastinate(x)
First Order Logic
A model containing five objects, two binary relations,
three unary relations (indicated by labels on the objects),
and one unary function, left-leg.
- 7. Unrestricted © Siemens AG 2016
April 2016Page 7 Corporate Technology
NLP is used to translate language to first order logic
Language to meaning
First Order LogictoNatural Language
... through Natural Language Processing
- 8. Unrestricted © Siemens AG 2016
April 2016Page 8 Corporate Technology
Natural Language Processing Tool Kit
About NLP and NLTK
•Natural Language Processing:
• field of Computer Science, Artificial Intelligence, and Computational Linguistics
• concerned with the interactions between computers and human languages
• started in 1950, the problems still remain:
• How do we capture meaning from human languages?
• How do we reason on them?
• How do we translate human language to something the machine can understand?
•Natural Language Tool Kit:
• provides interfaces to >50 corpora and lexical resources
• text processing libraries for classification, tokenization, tagging, parsing
and semantic reasoning
• visual demos for parsers
- 9. Unrestricted © Siemens AG 2016
April 2016Page 9 Corporate Technology
Translator subsystem extracts deep meaning of sentences
Natural Language to First Order Logic Pipeline
Tokenization
Syntactic
Analysis
Parts-of-speech
Tagging
Semantic
Generation
Semantic
Composition
• Sentence:
•“Cyril barks”
•Tokens:
•[“Cyril”, “barks”]
•Parts of speech:
•“Cyril” => “Proper Noun”, “barks” => “Verb”
•First-order logic expression:
•bark(cyril)
- 10. Unrestricted © Siemens AG 2016
April 2016Page 10 Corporate Technology
Tokenization
Tokenization
Tokenization
Syntactic
Analysis
Parts-of-speech
Tagging
Semantic
Generation
Semantic
Composition
•First step of language processing
•Break up the string into words and punctuation
•Remove whitespace, line breaks and blank lines
•Get discrete units of words in the form of a list
- 11. Unrestricted © Siemens AG 2016
April 2016Page 11 Corporate Technology
Parts-of-speech Tagging
Categorizing and Tagging Words
Tokenization
Syntactic
Analysis
Parts-of-speech
Tagging
Semantic
Generation
Semantic
Composition
•Classify words into categories for language processing
•Example: Noun, Verb, Determiner, etc
•Called parts-of-speech
•Process of classifying words into their POS and labeling:
•POS Tagger
•Taggers need to be trained
•Also called word-classes or lexical categories
•Common tags:
•NN => Noun
•NNP => Proper Noun
•VB => Verb
•DT => Determiner (the, at, …)
•P => Pronoun
- 12. Unrestricted © Siemens AG 2016
April 2016Page 12 Corporate Technology
Syntactic Analysis
Analyzing Sentence Structure
Tokenization
Syntactic
Analysis
Parts-of-speech
Tagging
Semantic
Generation
Semantic
Composition
•Sentence structure determined through Grammar
•Context Free Grammar most widely used
•Called Productions
•Example:
•S -> NP VP
•NP -> DT NN
•NP -> NNP
•VP -> IV
•Process of identifying and constructing sentence structure:
•Parsing
•Recursive descent parsing
•Shift reduce parsing
•Multiple parse trees => Ambiguous sentence
- 13. Unrestricted © Siemens AG 2016
April 2016Page 13 Corporate Technology
Syntactic Analysis
Recursive Descent Parser
Tokenization
Syntactic
Analysis
Parts-of-speech
Tagging
Semantic
Generation
Semantic
Composition
Six Stages of a Recursive Descent Parser: the parser begins with a tree consisting of the node S;
at each stage it consults the grammar to find a production that can be used to enlarge the tree;
when a lexical production is encountered, its word is compared against the input;
after a complete parse has been found, the parser backtracks to look for more parses.
- 14. Unrestricted © Siemens AG 2016
April 2016Page 14 Corporate Technology
Syntactic Analysis
Excerpt from Grammar File
Tokenization
Syntactic
Analysis
Parts-of-speech
Tagging
Semantic
Generation
Semantic
Composition
- 15. Unrestricted © Siemens AG 2016
April 2016Page 15 Corporate Technology
Semantic Generation
Analyzing the Meaning of Sentences
Tokenization
Syntactic
Analysis
Parts-of-speech
Tagging
Semantic
Generation
Semantic
Composition
•Semantics refer to the meaning of something
•Lambda calculus is used to represent meaning
•Generation of “lambda calculus” terms for each individual word
•Generation of grammar terms
- 16. Unrestricted © Siemens AG 2016
April 2016Page 16 Corporate Technology
Semantic Generation
Lambda Calculus
Tokenization
Syntactic
Analysis
Parts-of-speech
Tagging
Semantic
Generation
Semantic
Composition
•Rules for manipulating strings of symbols in the language
• term:
• variable
• term term
• (term)
• λ variable . term
•Lambda calculus expressions for each word:
•“Dev” => P.P(Dev)
•“walks” => xy.walks(x,y)
•“happy” => x.happy(x)
•β-reduction: computation in lambda calculus
• (x. M)N M [ x N ]
•Replace all x’s in M with N
•Example:
•(x.x+3)10 => (10+3) => 13
Function Application
Function Abstraction
- 17. Unrestricted © Siemens AG 2016
April 2016Page 17 Corporate Technology
Semantic Composition
Composing meaning of a sentence from words
Tokenization
Syntactic
Analysis
Parts-of-speech
Tagging
Semantic
Generation
Semantic
Composition
•Principle of composition:
• meaning of a whole is a function of the meanings of the
parts and of the way they are syntactically combined.
•Semantic grammar rules define how word semantics combine:
•Example:
Sentence: Dev runs
S[SEM = <app(?subj,?vp)>] -> NP[SEM=?subj] VP[SEM=?vp]
If VP semantics : x.run(x)
NP semantics: P.P(Dev)
Hence, total sentence semantic:
(P.P(Dev))(x.run(x)) =>
(x.run(x))(Dev) =>
(run(Dev))
Final sentence semantics: run(Dev)
•Using parse tree + grammar file, individual word semantics
β-reduce to the final sentence semantics
Excerpt from grammar file
- 18. Unrestricted © Siemens AG 2016
April 2016Page 18 Corporate Technology
The key challenge in parsing PDFs is its unstructured nature
•The other kind of knowledge that we have attempted to capture is from documents, such as scientific literature, which are mostly in PDF
format.
•The main challenge: Processing PDF documents
•Unlike web-pages, or XML-based documents that follow logical representation, PDFs follow a visual representation scheme
•Processing such unstructured data is almost impossible and thus PDFs need to be converted to an XML-based file first.
Knowledge capture from PDF Documents
- 19. Unrestricted © Siemens AG 2016
April 2016Page 19 Corporate Technology
PDFs are converted to HTML and then parsed.
•A PDF to HTML converter needs to be implemented.
•Available technologies:
• pdf2htmlEx (open source, implemented in C++, Python and JavaScript)
• PDFMiner (open source, implemented in Python)
• However, these are not perfect and often produce absurd results.
• The next step is to parse the HTML document.
• We have written a JavaScript library for this, Beautiful Soup JS, based on the Python library of the same name.
• Advantages of Beautiful Soup JS:
• Client-side parsing
• Light-weight (~10kB, compared to Beautiful Soup Python which is 200kB)
Proposed Solution
- 20. Unrestricted © Siemens AG 2016
April 2016Page 20 Corporate Technology
We have developed APIs for extracting text and images from PDFs
•The Poppler package in Python is used for:
• Extracting images from PDFs
• Extracting raw text from PDFs
• Beautiful Soup JS is used for:
• Extracting data from online articles
• Special Wikipedia parsing API to demonstrate how this can be used for specific websites
Currently Implemented APIs:
- 21. Unrestricted © Siemens AG 2016
April 2016Page 21 Corporate Technology
An overview of the proposed KBS Framework
The framework is designed keeping in mind four levels of interaction:
• Builds and maintains the various APIs
Developer
• Builds KBS for a specific application using the available APIs
Application Builder
• Feeds data and manages access rights of the built KBS software
Admin
• Queries the data on demand and if allowed, feeds data to the KBS.
KBS User
- 22. Unrestricted © Siemens AG 2016
April 2016Page 22 Corporate Technology
Future Work and Scope
Developing APIs to
• Obtain meaningful HTML output from PDFs
• Process CAD data, to design Knowledge based engineering systems for automation in design
•To process other documents such as presentations, spreadsheets
•To process speech data
Integration of document information retrieval and NLP
• Extract information out of documents, and transform them to First Order Logic
• Integrate the document parsing and translator subsystem
• Add wider approach to NLP
• Survey more accurate methods such as Neural Networks and Machine Learning