Shibut i11 168 8ee9e0ed-cr

The 6th
IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications
15-17 September 2011, Prague, Czech Republic
Selection and Aggregation of Sentences
in the Knowledge Formation Process
M.S. Shibut, V.S. Yakovishin
The Academy of Public Administration under the aegis of the President of the Republic of Belarus,
17, Moskovskaya Str., 220007, Minsk, Republic of Belarus, m_shibut@pac.by, http://pac.by/en
Abstract—The presented method is based on the use of
the special formal language. In the formal language, all
sentence structures are expressed as sets of syntactic
elements, syntagmes, which allows us to reduce the semantic
identification of sentences (their selection and aggregation)
to the use of set-theoretical inclusion. Input text sentences
are at first transformed into set-theoretical form, then the
resulting formal sentence structures are selected and united
into growing knowledge representations.
The integration of the sentences that have one and the
same subject (a noun phrase contained in user’s request) is
considered as a subject knowledge representation; and then
any collection of the subject knowledge representations
produced in the knowledge formation process is considered
as a user-oriented (highly tailored) description of subject
field.
Keywords—formal language, knowledge formation,
semantics, subject field, subject-knowledge representation,
syntax
I. INTRODUCTION
The knowledge formation is here presented as the
process of selection and aggregation of input sentences. In
this process, the text sentences are at first transformed into
the formal language, and then they are integrated into the
knowledge representation [1, 2]. The integration of the
sentences that have one and the same subject will be
considered as a subject knowledge representation, and any
collection of the subject knowledge representations,
produced in the knowledge formation process, will be
considered as a user-oriented (“highly tailored”)
description of subject field.
In every text sentence, the subject (usually
characterized as “the something or someone that the
sentence is about”, “the thing being talked about”) is
expressed by a grammatically separated noun phrase that
represents either the absolutely independent part of
sentence (the formal subject of the division subject-
predicate) or the general determinative part [3], i.e. the
attribute that relates to the whole sentence (the actual
subject of the division theme-rheme, also known as topic-
comment, representing the “reflection of the speaker’s
attitude towards what is said”). Both the formal subject
and the actual subject are always expressed by special
grammatical means. In many languages, the formal
subject occupies the main (first) position in sentences;
there are in addition special syntactically neutral word
forms to express the formal subject, namely, the noun
form of the nominative case (“casus indefinites”). The
actual subject (“theme” or “topic”) can be either coincided
with the formal subject or marked by extra actualization
means (such as special particles, inverted word order).
In the knowledge formation process, each required
noun phrase becomes the formative subject role. Some of
the noun phrases (contained in user’s request) can be
specially actualized (and elevated thereby to the subject
rank) according to user’s request information.
The presented here knowledge formation method is
based on the using of the special formal language. In the
formal language, input text sentences are expressed in the
set-theoretical (parenthesis-free, “discrete”) form as sets
of their syntactic elements (syntagmes), which allows us
to reduce the semantic identification of sentences to the
using of standard set-theoretical relation of inclusion.
The set-theoretical form and integration of input
sentences into the subject-knowledge representation will
be considered below.
II. FORMAL LANGUAGE
We proceed from the assumption that the formal
language corresponds to a dibasic algebra with a set of
words and a set of sentences. The set of words represents
a free semi-group over the basic alphabet. The set of
sentences represents a ring-like algebra with one unary
operation and a pair of binary operations - coordination
and determination: the coordination (“addition”) is
commutative and associative; the determination
(“multiplication”) is non-commutative, non-associative,
and one-sided (left-hand) distributive over coordination
[3].
There is now a need for some set of special, auxiliary,
symbols used to represent given algebraic operations.
Then the formal language (L) is in general defined as a
set of sentences derived over a set of words by means of
algebraic operations, i.e.
L ⊆ L(A*∪Ω),

where A* is a set of words over a basic alphabet A; Ω is
an auxiliary alphabet, i.e. a set of symbols of algebraic
operations (A∩Ω=∅).
So the formal language implies the use of both words
and operation symbols. The words and the operation
symbols can serve as a syntactic basis in expression of
two distinct types of semantic elements, namely, lexical
and grammatical (functional) meanings. The words are
used for expression of lexical meanings, and the
operation symbols are used for expression of grammatical
meanings: the symbol of unary operation represents
general grammatical meanings of sentences (modality,
negation, question, exclamation, etc.), and the symbols of
binary operations represent functional meanings of
sentence parts. Thus, two description levels are clearly
defined in the formal language: the syntactic level
represents purely abstract (algebraic) sentence structures
(in compliance with distinctive features of the given
algebraic operations) and the semantic level represents
the sense interpretation of the algebraic structures.
The syntax that answers to the above mentioned
algebraic operations can be represented by the following
set of rules:
S→ ◊S | (X∇X) ⎜(XΔS)⎜X,
X→ X∇X ⎜(XΔS),
where S (“sentence”) is the initial symbol; X (“word”) is
the start symbol for word derivation; ◊ (“modality”), ∇
(“coordination”), Δ (“determination”) are the operation
symbols.
In the syntactic rules, the commutative and
associative coordination is expressed as the atomic
formula X∇X in which the same symbol is used for both
members in parentheses-free notation. The coordination
in the rule with parentheses S→(X∇X) is necessary for
expression of the left-distributive property of
determination over coordinated row in strings like
XΔ(X∇X…). The non-commutative and non-associative
determination is expressed as an obligatory use of the
different symbols and parentheses in the rules S→(XΔS),
X→(XΔS). The parentheses must certainly be used to
indicate the “evaluation” order of the non-associative
operation. And so the atomic formula (XΔS) is
represented twice: in case S→(XΔS), the operation Δ is
“evaluated” from right to left as in (XΔ(XΔS…)); in case
X→(XΔS), it is “evaluated” from left to right as in
((…XΔS)ΔS).
The sentence syntagmatic structures are now
expressed in the explicit form: every independent (head)
member of the generated syntagmatic structure occupies
the first position, and the dependent (defining, non-head)
member is connected by means of determination symbol
in the second position. So the simplest syntagmatic
structure, the syntagme, has the form of a two-word string
like (X1ΔX2), where X1 (the first word) is the independent
(head) member; ΔX2 (the second word) is the
determinative (dependent) member; X1, X2 are words
representing lexical meanings; Δ (determination) is a
symbol indicating a syntactic (functional) meaning of the
sentence part. Note that the words X1,X2∈A* can take an
empty string value, and then the initial syntagme (X1ΔX2)
can be shorted either to a head member X1, representing a
reduced syntagme, or to a determinative member ΔX2,
representing an elliptical syntagme.
Thus, syntagmatic structures can be expressed in the
standard algebraic form using the parenthesis notation
with the fixed order. As will be shown below, it is also
possible to obtain an alternative, set-theoretical
(parenthesis-free), form of syntagmatic structures. The
proposed here knowledge representation uses the set-
theoretical form, in which every syntagmatic structure is
expressed as a set that contains all the given syntagmes as
usual set’s elements.
III. SET-THEORETICAL FORM OF KNOWLEDGE
REPRESENTATION
The set-theoretical form of knowledge representation
is indeed sufficient for explicit expression of all the
differences of syntagmatic structures, such as the stepwise
and collateral subordinations, the homogeneous parts, the
absolutely independent part, and the common depended
pert (determinative), see [3]. The following identical
syntagmatic structures are given by both parenthesis and
set-theoretical notations. In the both notations, all
differences of syntactical relations are explicit expressed.
Subordination types. The stepwise subordination is
expressed as a syntagmatic structure where the dependent
member of the previous syntagme serves as the head
member of the consequent syntagme (as in the book of the
new author):
(X1Δ1(X2Δ2X3))={X1Δ1X2, X2Δ2X3},
where X2 is both a dependent member of syntagme X1Δ1X2
and a head member of syntagme X2Δ2X3. In the collateral
(parallel) subordination, several dependent members are
connected with their common head member (as in the new
book of the author):
((X1Δ1X2)Δ2X3) = {X1Δ1X2, X1Δ2X3},
where X1 is a head member; Δ1X2, Δ2X3 are dependent
members.
Homogeneous parts can be represented as several
dependent members that are subordinated to a single head
member. The determination sign represents in that case a
common functional meaning of the parts of the sentence,
whereas coordination sign serves as a means of juncture
together the identical parts by extracting the common
functional meaning outside the brackets:

(X1Δ(X2∇X3))={X1ΔX2, X1ΔX3},
where ∇ is the symbol of coordination; ΔX2, ΔX3 are
homogeneous parts.
Absolutely independent part is a single
syntagmatically unmarked word, which does not have its
governing member and therefore it does not contain the
determination sign denoting the meaning of the part of the
sentence. For the sake of simplicity of sentence
identification, the absolutely independent part will be
emphasized in the set-theoretical form as a separate set’s
element, e.g.:
((X1Δ1X2)Δ2X3) = {X1, X1Δ1X2, X1Δ2X3},
where an absolutely independent part X1 repeats itself in
the form of a reduced syntagme.
Determinative. There exists a possibility to express a
common dependent member, called the determinative, i.e.
the attribute that relates to the whole sentence. In the set-
theoretical form, the determinative can be emphasized as a
separate set’s element. In contrast to the emphasized
absolutely independent part, the determinative contains a
determination symbol, i.e. it is represented as an elliptical
syntagme, e.g.:
((X1Δ1(X2Δ2X3))Δ3X4) =
= {Δ3X4, X1, X1Δ1X2, X2Δ2X3},
where Δ3X4 is a determinative (e.g., In the evening, he
reads a book); cp., with
(X1Δ1(X2Δ2X3)Δ3X4) =
= {X1, X1Δ1X2, X2Δ2X3, X2Δ3X4},
where Δ3X4 is an adverbial modifier (as in He walked in
the garden in the evening).
In the formal descriptions of sentences, the lexical
meanings can be expressed by usual word stems, while the
meanings of the sentence parts are in need of conventional
signs – such as a (attribute), p (predicate), pt (predicate in
past indefinite tense) o (direct object), in (adverbial
modifier of place, “inside”), etc. It is convenient for
clearness to separate the words and the conventional signs
by the punctuation marks (the underscore character and
the dot). So the real sentence, e.g., The young man reads a
book in the garden, can be expressed by the following set
of syntagmes:
man_a.young ‘[the/a] young man’,
man_p.read ‘[the/a] man reads’,
read_o.book ‘[to] read [the/a] book’,
read_in.garden ‘[to] read in [the/a] garden’.
Thus, one can suppose that any sentence can be
represented as a certain set of syntagmes. It is also
assumed that any sentence consists of at least one noun
phrase (a noun or a noun with several its modifiers). Then
the integration of sentences that have one and the same
subject, i.e. a noun phrase contained in user’s request
information, can be considered as subject knowledge
description (representation).
So subject knowledge description can be defined as a
set σ(N) of sentences S1, S2, … that contain the common
subject, represented by a noun phrase N, i.e.
σ(N)={S ⊇ N | S is a sentence};
and then any collection of subject knowledge descriptions
produced in the knowledge formation process is a (user-
oriented) subject field description, i.e.
σ(N1, N2,…) = {σ(N1), σ(N2), …},
where σ(N1, N2,…) is a subject field in which N1, N2,…
are noun phrases that play the role of subjects.
Note that the subjects can be represented by a noun
phrase expressed as the absolutely independent part, i.e.
the formal subject of the division “subject-predicate”, or
as the determinant, i.e. the actual subject of the division
“theme-rheme”. In the text sentence, actual subject can be
either coincided with the formal subject or marked by
extra actualization means. In the knowledge formation
process, each noun phrase, contained in user’s request, can
become the formative subject role, i.e. some of the noun
phrases can be specially actualized (and elevated thereby
to the subject role) by special means (e.g., by the inverted
word order) according to information questions.
IV. RULES OF SUBJECT-KNOWLEDGE FORMATION
Subject knowledge formation is a growth process in
which two formation rules, namely the rules of selection
and aggregation of sentences, must realize.
The first of the rules permits to make a selection of all
the more intensionally informative sentences by means of
elimination of the sentences that are less informative than
another sentence. The intensional superiority is defined
considering the inclusion between the sets: one of the
sentences S1 and S2 must be eliminated, if it is a subset of
another sentence. That is, the rule can be formalized as
follows:
{S1, S2}→S1, if S1 ⊇ S2 – selection rule.
The second rule realizes the integration in a collection
of already selected sentences. So, if S1, S2, … are
sentences that have the same subject N (a noun phrase
contained in user’s request), they will unite in common
subject knowledge description:

{S1, S2, …}→ σ(N) – aggregation rule.
The following examples will illustrate the selection
and aggregation processes. (We restrict, for the sake of
simplicity, examples to denote only simple sentences.)
Let S1, S2, S3, S4, S5 be sentences, expressed in terms of
formal language, such as
S1 = {man, man_a.young, man_p.read,
read_o.book}
‘The young man reads a book’
S2 = {man, man_a.young, man_p.read,
read_o.book, read_in.library}
‘The young man reads a book in the library’
S3 = {man, man_pt.walk, walk_in.park}
‘The man walked in the park’
S4 = {library, library_pPs. situate,
situate_in.street, street_a. graceful}
‘The library is situated in a graceful street’
S5 = {man, man_a.young, man_pt.kick,
kick_o.ball}
‘The young man kicked the ball’
where a, in, o are signs of the secondary sentence parts; p,
pt, pPs are signs of the different predicates (for the
present, past indefinite, and present simple passive,
respectively).
According to the selection rule, the first sentence must
be eliminated because of intensional superiority of the
second sentence (S1 ⊆ S2).
The sentences S2, S3, S4, S5 can be integrated in
compliance with the aggregation rule. Let “man”, “young
man”, “library” be the subjects contained in user’s request.
Then, as a result of integration on the given subjects, the
following three subject knowledge descriptions can be
obtained:
σ ({man}) = { S2, S3, S5}
σ ({man, man_a.young}) = { S2, S5}
σ ({library}) = { S2, S4}
Note that the noun phrase “library” contained in S2
must be actualized (by the inverted word order), i.e.
sentence S2 will be transformed into the actual division
form: In the library, the young man reads a book.
V. PROSPECTS OF APPLICATION
Realization of the subject knowledge formation
process (and creation of various knowledge-based
systems) makes it possible to obtain effective solutions of
a whole number of pressing problems. In particular, the
following is noteworthy.
Knowledge-based text adaptation. The subject
knowledge description produced in the formation process
can be used as a basis for automatic creation (synthesis) of
adapted (user-oriented) text materials - such as
information-analytical reviews, electronic textbooks,
individual teaching materials [1].
Knowledge-based information search. The
information search with great precision can be realized as
a two-stage process (that resembles the ore processing): in
the first stage of the process (data search, “ore mining”),
the usual information retrieval is realized to draw
information (as full as possible) from a number of sources
(that contains valuable elements); in the second stage
(knowledge search, “ore dressing”), the obtained results
are processed to extract only the important information
(knowledge, valuable elements).
Knowledge-based machine translation. In the
translation of the source text from one natural language to
another, the subject knowledge base (where the lexical
compatibility is fixed) can be used as a supporting
interlingua, that plays the role of an effective filter for
screening all the misplaced meanings of polysemous
words.
REFERENCES
[1] M.S. Shibut, V.S. Yakovishin Method for creating customized
training materials based on the processing of electronic
information resources, Proceedings of the conference "Applied
Linguistics in science and education", dev. memory of Professor.
R.G. Piotrowski, St. Petersburg, 25-26 March 2010. -
St. Petersburg, "Lem", 2010. - P. 339 – 345 (in Russian)
[2] M.S. Shibut, V.S. Yakovishin Recognition of grammatical
information in the process of linguistic knowledge, Topical
Problems of Theoretical and Applied Linguistics: Proceedings of
the Intern. scientific. Conf. dev. memory of Professor.
R.G. Piotrowski, Minsk, 15-16 June 2010, Part 2. - Minsk, 2010. -
P. 143-147 (in Russian)
[3] V.S. Yakovishin Algebraic representation of syntagmatic
structures, Web Journal of Formal, Computational & Cognitive
Linguistics, Issue 11, 2009 [Electronic resource]. – Mode of
access: http//fccl.ksu.ru/issue11.

Shibut i11 168 8ee9e0ed-cr

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Shibut i11 168 8ee9e0ed-cr

Similar to Shibut i11 168 8ee9e0ed-cr (20)

Recently uploaded

Recently uploaded (20)

Shibut i11 168 8ee9e0ed-cr