2. Arden Theorem........................................................................................................................ 2
Pumping Lemma for regular expressions..........................................................................................3
Use of lemma...................................................................................................................................4
Proof of the pumping lemma ...........................................................................................................4
General version of pumping lemma for regular languages ..................................................................................5
Converse of lemma not true.................................................................................................................................5
My hill-Nerode theorem .......................................................................................................... 6
Context free grammar: Ambiguity........................................................................................... 8
Recognizing ambiguous grammars ...................................................................................................8
Inherently ambiguous languages......................................................................................................9
Simplification of CFGs.............................................................................................................. 9
Normal forms for CFGs................................................................................................................... 10
Pumping lemma for CFLs ....................................................................................................... 11
Usage of the lemma ....................................................................................................................... 11
Ambiguous to Unambiguous CFG .......................................................................................... 12
Arden Theorem
In theoretical computer science, Arden's rule, also known as Arden's lemma, is a mathematical
statement about a certain form of language equations. Let P and Q be two Regular Expression s over
Σ. If P does not contain Λ, then for the equation
R = Q + RP has a unique (one and only one) solution R = QP*
.
Proof:
Now point out the statements in Arden's Theorem in General form.
3. (i) P and Q are two Regular Expressions.
(ii) P does not contain Λ symbol.
(iii) R = Q + RP has a solution, i.e. R = QP*
(iv) This solution is the one and only one solution of the equation.
If R = QP*
is a solution of the equation R = Q + RP then by putting the value of R in the equation
we shall get the value ‘0’.
(Putting the value of R in the LHS we get)
So from here it is proved that R = QP*
is a solution of the equation R = Q + RP.
Pumping Lemma for regular expressions
Let L be a regular language. Then there exists an integer p ≥ 1 depending only on L such that
every string w in L of length at least p (p is called the "pumping length") can be written as w =
xyz (i.e., w can be divided into three substrings), satisfying the following conditions:
1. |y| ≥ 1;
2. |xy| ≤ p
3. for all i ≥ 0, xyi
z ∈ L
y is the substring that can be pumped (removed or repeated any number of times, and the
resulting string is always in L). (1) means the loop y to be pumped must be of length at least one;
(2) means the loop must occur within the first p characters. |x| must be smaller than p (conclusion
of (1) and (2)), apart from that there is no restriction on x and z.
In simple words, for any regular language L, any sufficiently long word w (in L) can be split into
3 parts. i.e. w = xyz , such that all the strings xyk
z for k≥0 are also in L.
4. Below is a formal expression of the Pumping Lemma.
Use of lemma
The pumping lemma is often used to prove that a particular language is non-regular: a proof by
contradiction (of the language's regularity) may consist of exhibiting a word (of the required
length) in the language which lacks the property outlined in the pumping lemma.
For example the language L = {an
bn
: n ≥ 0} over the alphabet Σ = {a, b} can be shown to be
non-regular as follows. Let w, x, y, z, p, and i be as used in the formal statement for the pumping
lemma above. Let w in L be given by w = ap
bp
. By the pumping lemma, there must be some
decomposition w = xyz with |xy| ≤ p and |y| ≥ 1 such that xyi
z in L for every i ≥ 0. Using |xy| ≤ p,
we know y only consists of instances of a. Moreover, because |y| ≥ 1, it contains at least one
instance of the letter a. We now pump y up: xy2
z has more instances of the letter a than the letter
b, since we have added some instances of a without adding instances of b. Therefore xy2
z is not
in L. We have reached a contradiction. Therefore, the assumption that L is regular must be
incorrect. Hence L is not regular.
The proof that the language of balanced (i.e., properly nested) parentheses is not regular follows
the same idea. Given p, there is a string of balanced parentheses that begins with more than p left
parentheses, so that y will consist entirely of left parentheses. By repeating y, we can produce a
string that does not contain the same number of left and right parentheses, and so they cannot be
balanced.
Proof of the pumping lemma
For every regular language there is a finite state automaton (FSA) that accepts the language. The
numbers of states in such an FSA are counted and that count is used as the pumping length p. For
a string of length at least p, let s0 be the start state and let s1, ..., sp be the sequence of the next p
states visited as the string is emitted. Because the FSA has only p states, within this sequence of
p + 1 visited states there must be at least one state that is repeated. Write S for such a state. The
transitions that take the machine from the first encounter of state S to the second encounter of
state S match some string. This string is called y in the lemma, and since the machine will match
a string without the y portion, or the string y can be repeated any number of times, the conditions
of the lemma are satisfied.
5. For example, the following image shows an FSA.
The FSA accepts the string: abcd. Since this string has a length which is at least as large as the
number of states, which is four, the pigeonhole principle indicates that there must be at least one
repeated state among the start state and the next four visited states. In this example, only q1 is a
repeated state. Since the substring bc takes the machine through transitions that start at state q1
and end at state q1, that portion could be repeated and the FSA would still accept, giving the
string abcbcd. Alternatively, the bc portion could be removed and the FSA would still accept
giving the string ad. In terms of the pumping lemma, the string abcd is broken into an x portion
a, a y portion bc and a z portion d.
General version of pumping lemma for regular languages
If a language L is regular, then there exists a number p ≥ 1 (the pumping length) such that every
string uwv in L with |w| ≥ p can be written in the form
uwv = uxyzv
with strings x, y and z such that |xy| ≤ p, |y| ≥ 1 and
uxyi
zv is in L for every integer i ≥ 0.
This version can be used to prove many more languages are non-regular, since it imposes stricter
requirements on the language.
Converse of lemma not true
Note that while the pumping lemma states that all regular languages satisfy the conditions
described above, the converse of this statement is not true: a language that satisfies these
conditions may still be non-regular. In other words, both the original and the general version of
the pumping lemma give a necessary but not sufficient condition for a language to be regular.
For example, consider the following language L:
.
6. In other words, L contains all strings over the alphabet {0,1,2,3} with a substring of length 3
including a duplicate character, as well as all strings over this alphabet where precisely 1/7 of the
string's characters are 3's. This language is not regular but can still be "pumped" with p = 5.
Suppose some string s has length at least 5. Then, since the alphabet has only four characters, at
least two of the five characters in the string must be duplicates. They are separated by at most
three characters.
• If the duplicate characters are separated by 0 characters, or 1, pump one of the other two
characters in the string, which will not affect the substring containing the duplicates.
• If the duplicate characters are separated by 2 or 3 characters, pump 2 of the characters
separating them. Pumping either down or up results in the creation of a substring of size 3
that contains 2 duplicate characters.
• The second condition of L ensures that L is not regular: i.e., there are an infinite number
of strings that are in L but cannot be obtained by pumping some smaller string in L.
For a practical test that exactly characterizes regular languages, see the Myhill-Nerode theorem.
The typical method for proving that a language is regular is to construct either a finite state
machine or a regular expression for the language.
My hill-Nerode theorem
parity machine, acceptor accepting {an
bm
|n,m ≥ 1}, and many more such examples. Consider the
serial adder. After getting some input, the machine can be in ‘carry’ state or ‘no carry’ state. It
does not matter what exactly the earlier input was. It is only necessary to know whether it has
produced a carry or not. Hence, the FSA need not distinguish between each and every input. It
distinguishes between classes of inputs. In the above case, the whole set of inputs can be
partitioned into two classes – one that produces a carry and another that does not produce a carry.
Similarly, in the case of parity checker, the machine distinguishes between two classes of input
strings: those containing odd number of 1’s and those containing even number of 1’s. Thus, the
FSA distinguishes between classes of input strings. These classes are also finite. Hence, we say
that the FSA has finite amount of memory.
The following three statements are equivalent.
1. L ⊆ Σ* is accepted by a DFSA.
2. L is the union of some of the equivalence classes of a right invariant equivalence relation
of finite index on Σ*.
3. Let equivalence relation RL be defined over Σ* as follows: xRL y if and only if, for all z
∊ Σ*, xz is in L exactly when yz is in L. Then RL is of finite index.
Proof We shall prove (1) ⇒ (2), (2) ⇒ (3), and (3) ⇒ (1).
7. (1) ⇒ (2)
Let L be accepted by a FSA M = (K, Σ, δ, q0, F). Define a relation RM on Σ* such that xRMy if δ(q0, x) = δ(q0,y). RM is an equivalence relation,
as seen below.
∀x xRMx, since δ(q0, x) = δ(q0, x),
∀x xRMy ⇒ yRMx ∵ δ(q0, x) = δ(q0, y) which means δ(q0, y) = δ(q0, x),
∀x, y xRM y and yRMz ⇒ xRMz.
For if δ(q0, x) = δ(q0, y) and δ(q0, y) = δ(q0, z) then δ(q0, x) = δ(q0, z).
So RM divides Σ* into equivalence classes. The set of strings which take the machine from q0 to
a particular state qi are in one equivalence class. The number of equivalence classes is therefore
equivalent to the number of states of M, assuming every state is reachable from q0. (If a state is
not reachable from q0, it can be removed without affecting the language accepted). It can be
easily seen that this equivalence relation RM is right invariant, i.e., if
xRM y, xzRM yz ∀z ∊ Σ*.
δ(q0, x) = δ (q0, y) if xRM y,
δ(q0, xz) = δ(δ (q0, x), z) = δ(δ (q0, y), z) = δ(q0, yz). Therefore xzRM yz.
L is the union of those equivalence classes of RM which correspond to final states of M.
(2) ⇒ (3)
Assume statement (2) of the theorem and let E be the equivalence relation considered. Let RL be
defined as in the statement of the theorem. We see that xEy ⇒ xRL y.
If xEy, then xzEyz for each z ∊ Σ*. xz and yz are in the same equivalence class of E. Hence, xz
and yz are both in L or both not in L as L is the union of some of the equivalence classes of E.
Hence xRL y.
Hence, any equivalence class of E is completely contained in an equivalence class of RL.
Therefore, E is a refinement of RL and so the index of RL is less than or equal to the index of E
and hence finite.
(3) ⇒ (1)
First, we show RL is right invariant. xRL y if ∀z in Σ*, xz is in L exactly when yz is in L or we
can also write this in the following way: xRL y if for all w, z in Σ*, xwz is in L exactly when ywz
is in L.
8. If this holds xwRLyw.
Therefore, RL is right invariant.
Let [x] denote the equivalence class of RL to which x belongs.
Construct a DFSA ML = (K′, Σ, δ′, q0, F′) as follows: K′ contains one state corresponding to each
equivalence class of RL. [ε] corresponds to q′0. F′ corresponds to those states [x], x ∊ L. δ′ is
defined as follows: δ′ ([x], a) = [xa]. This definition is consistent as RL is right invariant.
Suppose x and y belong to the same equivalence class of RL. Then, xa and ya will belong to the
same equivalence class of RL. For,
δ′([x], a) = δ′([y], a)
⇓ ⇓
[xa] = [ya]
if x ∊ L, [x] is a final state in M′, i.e., [x] ∊ F′. This automaton M′ accepts L.
Context free grammar: Ambiguity
An ambiguous grammar is a formal grammar for which there exists a string that can have more
than one leftmost derivation, while an unambiguous grammar is a formal grammar for which
every valid string has a unique leftmost derivation. Many languages admit both ambiguous and
unambiguous grammars, while some languages admit only ambiguous grammars. Any non-
empty language admits an ambiguous grammar by taking an unambiguous grammar and
introducing a duplicate rule or synonym (the only language without ambiguous grammars is the
empty language). A language that only admits ambiguous grammars is called an inherently
ambiguous language, and there are inherently ambiguous context-free languages. Deterministic
context-free grammars are always unambiguous, and are an important subclass of unambiguous
CFGs; there are non-deterministic unambiguous CFGs, however. For real-world programming
languages, the reference CFG is often ambiguous, due to issues such as the dangling else
problem. If present, these ambiguities are generally resolved by adding precedence rules or other
context-sensitive parsing rules, so the overall phrase grammar is unambiguous.
Recognizing ambiguous grammars
The general decision problem of whether a grammar is ambiguous is undecidable because it can
be shown that it is equivalent to the Post correspondence problem. At least, there are tools
9. implementing some semi-decision procedure for detecting ambiguity of context-free grammars.
The efficiency of context-free grammar parsing is determined by the automaton that accepts it.
Deterministic context-free grammars are accepted by deterministic pushdown automata and can
be parsed in linear time, for example by the LR parser.[2]
This is a subset of the context-free
grammars which are accepted by the pushdown automaton and can be parsed in polynomial time,
for example by the CYK algorithm. Unambiguous context-free grammars can be
nondeterministic. For example, the language of even-length palindromes on the alphabet of 0 and
1 has the unambiguous context-free grammar S → 0S0 | 1S1 | ε. An arbitrary string of this
language cannot be parsed without reading all its letters first which means that a pushdown
automaton has to try alternative state transitions to accommodate for the different possible
lengths of a semi-parsed string.[3]
Nevertheless, removing grammar ambiguity may produce a
deterministic context-free grammar and thus allow for more efficient parsing. Compiler
generators such as YACC include features for resolving some kinds of ambiguity, such as by
using the precedence and associativity constraints.
Inherently ambiguous languages
Inherent ambiguity was proven with Parikh's theorem in 1961 by Rohit Parikh in an MIT
research report.
While some context-free languages (the set of strings that can be generated by a grammar) have
both ambiguous and unambiguous grammars, there exist context-free languages for which no
unambiguous context-free grammar can exist. An example of an inherently ambiguous language
is the union of with . This set is context-
free, since the union of two context-free languages is always context-free. But Hopcroft &
Ullman (1979) give a proof that there is no way to unambiguously parse strings in the (non-
context-free) subset which is the intersection of these two languages.
Simplification of CFGs
a context-free grammar (CFG) is a formal grammar in which every production rule is of the form
V → w
Where V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals (w
can be empty). A formal grammar is considered "context free" when its production rules can be
applied regardless of the context of a nonterminal. It does not matter which symbols the
nonterminal is surrounded by, the single nonterminal on the left hand side can always be
replaced by the right hand side. Languages generated by context-free grammars are known as
10. context-free languages (CFL). Different Context Free grammars can generate the same context
free language. It is important to distinguish properties of the language (intrinsic properties) from
properties of a particular grammar (extrinsic properties). Given two context free grammars, the
language equality question (do they generate the same language?) is undecidable. Context-free
grammars are important in linguistics for describing the structure of sentences and words in
natural language, and in computer science for describing the structure of programming languages
and other formal languages. In linguistics, some authors use the term phrase structure grammar
to refer to context-free grammars, whereby phrase structure grammars are distinct from
dependency grammars. In computer science, a popular notation for context-free grammars is
Backus–Naur Form, or BNF.
Normal forms for CFGs
If L(G) does not contain , then G can have a CNF form with productions only of type
where
11. Pumping lemma for CFLs
The pumping lemma for context-free languages, also known as the Bar-Hillel lemma, is a lemma
that gives a property shared by all context-free languages. If a language L is context-free, then
there exists some integer p ≥ 1 such that every string s in L with |s| ≥ p (where p is a "pumping
length") can be written as
s = uvxyz with substrings u, v, x, y and z, such that
1. |vxy| ≤ p,
2. |vy| ≥ 1, and
3. uv n
xy n
z is in L for all n ≥ 0.
Usage of the lemma
The pumping lemma for context-free languages can be used to show that certain languages are
not context-free. For example, we can show that language is not context-
free by using the pumping lemma in a proof by contradiction. First, assume that is context
free. By the pumping lemma, there exists an integer which is the pumping length of language
. Consider the string in . The pumping lemma tells us that can be written in
the form , where , and are substrings, such that ,
, and is in for every integer . By our choice of and the fact that
12. , it is easily seen that the substring can contain no more than two distinct letters.
That is, we have one of five possibilities for :
1. for some .
2. for some and with .
3. for some .
4. for some and with .
5. for some .
For each case, it is easily verified that does not contain equal numbers of each letter
for any . Thus, does not have the form . This contradicts the definition
of . Therefore, our initial assumption that is context free must be false.
While the pumping lemma is often a useful tool to prove that a given language is not context-
free, it does not give a complete characterization of the context-free languages. If a language
does not satisfy the condition given by the pumping lemma, we have established that it is not
context-free. On the other hand, there are languages that are not context-free, but still satisfy the
condition given by the pumping lemma. There are more powerful proof techniques available,
such as Ogden's lemma, but also these techniques do not give a complete characterization of the
context-free languages.
Ambiguous to Unambiguous CFG
While some context-free languages (the set of strings that can be generated by a grammar) have
both ambiguous and unambiguous grammars, there exist context-free languages for which no
unambiguous context-free grammar can exist. An example of an inherently ambiguous language
is the union of with . This set is context-
free, since the union of two context-free languages is always context-free. But Hopcroft &
Ullman (1979) give a proof that there is no way to unambiguously parse strings in the (non-
context-free) subset which is the intersection of these two languages.