1. Separating Shuffle Regular Expressions for Data
Words
Manoj Kilaru Etienne Lozes Sylvain Schmitz
May 27, 2016
1 Definitions
1.1 Background
A data word is a finite sequence w =
a1
d1
. . .
an
dn
∈ (Σ × D)∗
, where Σ is a
finite set of symbols, and D is an infinite set of data (in all examples, D = N).
The string projection [standard name? -el] str(w) is the word a1 . . . an ∈ Σ∗
.
For a fixed data word w =
a1
d1
. . .
an
dn
, the length |w| of w is n, and the
underlying equivalence relation ∼w is {(i, j) | i, j ≤ |w| and di = dj}. For w, w
two data words, we write w ≈ w if str(w) = str(w ) and ∼w=∼w .
A data language is a set of words L ⊆ (Σ × D)∗
closed under ≈.
[TODO: Define RA and DA -el]
Several formalisms for describing data languages have been proposed, that
can be grouped in three categories
• register automata (RA, see [KF94])
• first/second order logic (most notably FO2
(+1, <, ∼), see [BDM+
11]) [de-
fine these logics -el]
• data automata (DA, see [BDM+
11])
For any RA, a FO2
(+1, <, ∼) formula recognizing the same language can be
computed [complexity? -el], and for any FO2
(+1, <, ∼) formula, an equivalent
DA can be computed in 2-EXPTIME. However, there are no converse transla-
tions possible: the language defined by the formula ∀x, y.x = y ⇒ x ∼ y can’t
be recognized by a register automaton (example language for FO2
RA?).
[We also found a direct encoding of RA into DA a bit simpler than the one
that can be found in [BS07]. The idea is to guess the run of the RA, and guess
for every position of the run manipulating a register r whether this is the last
acces to this register until an update of its content. -el]
1
2. Given a language L described by a RA/DA/FO formula, one might want to
know whether it contains at least one word; this is called the emptiness prob-
lem for RA/DA and the satisfiability problem for FO formulas. For RA/DA,
emptiness is decidable, and satisfiability is decidable as well for FO2
(+1, <, ∼).
However, pretty close formalisms for data words, like FO3
(+1, ∼), have unde-
cidable emptiness/satisfiability.
1.2 Separating shuffle
A data word w is a separating shuffle of w1 and w2, noted w ∈ w1 w2, if
w is a shuffle of w1 and w2, and if the data occurring in w1 are disjoint from
the data of w2. For instance,
a
1
b
2
c
1
d
3
∈
a
1
c
1
b
2
d
3
, but
a
1
b
2
c
1
d
3
∈
a
1
d
3
b
2
c
1
, because 1 is a common data of the
two words. is extended to languages as expected: L1 L2 is the set of words
w such that w ∈ w1 w2 for at least one w1 ∈ L1 and one w2 ∈ L2.
The separating shuffle can be combined with the boolean connectives and
regular expressions (RE) to define separating shuffle regular expressions (SSRE):
e ::= | a ∈ Σ | e.e | e ∨ e | e∗
(RE)
ϕ ::= e | ϕ ∧ ϕ | ϕ ∨ ϕ | ¬ϕ | ϕ ϕ (SSRE)
[define SSRE with homomorphisms -el]
The semantics of a SSRE is defined as expected. For instance, the SSRE
¬(¬ ¬ ) ∧ ¬
describes the set of data words with always the same data occurring in all
positions. If we abbreviate this SSRE as atom, the following
¬ (atom ∧ ¬
a∈Σ
a)
describes the set of words that satisfy the formula ∀x, y.x = y ⇒ x ∼ y.
2 Questions
We are interesting in the following questions:
• how SSRE relate to RA/DA/FO2
?
• how the projections of SSRE languages relate to multicounter automata,
or shuffle-star definable languages?
• is emptiness decidable for SSRE?
• what about variants of this (like for instance if we allow concatenation
above shuffle)?
2
3. • can we axiomatize SSRE (like Kleene algebras [Koz02], concurrent Kleene
algebras [HMSW11], nominal Kleene algebras [GC11, KMS15]).
• what is the role of homormorphisms in SSRE (expressiveness, decidabil-
ity)?
3 Results
Theorem 1. Let L be such that L/ ≈ is finite. Then L is SSRE definable.
Proof. It is enough to show that L0 := {w|w ≈ w0} is SSRE definable for any
w0. We reason by induction on |w0|:
• if w0 = , then L0 is defined by the SSRE
• if w0 = a, d , then L0 is defined by the SSRE a
• otherwise w0 =
a
d
· wR = wL ·
a
d
and by induction there are eR
and eL that define [wR]≈ and [wL]≈ respectively. Then a.eR ∧ eL.a ∧
1class . . . 1class
k times
, where k is the index of ∼w0 , defines [w0]≈. Indeed,
if either d or d occurs in the middle of the word, then ∼wR
and ∼wL
completely determine ∼w0
. Otherwise, wether d = d is determined by
the number of equivalence classes of ∼w0
.
Theorem 2. Every DA definable language is definable by a SSRE+hom.
Proof. [TODO. Check wether this can be done without any negations, using
just the duals of all connectives. -el]
Proof using negations
Consider Data automata A = < τ, B > where the τ is a letter to letter trans-
ducer and B is a class automata which is non deterministic finite automata.
Now as B is NFA it can be expressed in a regular expression let Be
.
A Data word v will be in language of A if in the output of the base automata
(let) v there is no class which get rejected by the class automata
let the language, shuffle of B be represented by B = ¬((¬Be
∧ 1class) Σ∗
)
Then L(A) = τ−1
(B ) where τ : A∗
→ C∗
so τ−1
: C∗
→ A∗
and let us label
every transition by new alphabet Σ . For example let the
c:a
−−→
t
where c ∈ C∗
and b ∈ B∗
and t ∈∗
Σ
let functions g and h be g : t → c and h : t → a.
3
4. After this labelling let the automata be T
Now
L(A) = τ−1
(B )
= h(T ∩ g−1
(B ))
= f(T ∩ B )
where
B = {b1t1b2t2 . . . bmtm/b1b2 · · ·m ∈ B }
T = {b1t1b2t2 · · ·m tm/t1t2 . . . tm ∈ L(T) ∧ ∀ig(ti) = bi}
f : bi → ∧ ti → h(ti)
Hence the language of Data automata is expressed in SSRE + homomorphisms
accepting the same language.
Theorem 3. The emptiness and universality of SSRE + homomorphisms is
undecidable.
Proof. It is known that universality for RA is undecidable [NSV04] and every
RA can be expressed as DA accepting the same language [BDM+
11].
So universality of DA is undecidable and thereby universality of SSRE + ho-
momorphisms is undecidable and SSRE + homomorphisms are closed under
negations hence emptiness of SSRE + homomorphisms is also undecidable.
Let Σn = {ai, bi | i = 1, . . . , n} and Ln the language defined by
n
i=1
∀x.bi(x) ⇒ ∃y.y < x ∧ ai(y) ∧ y ∼ x ∧ ∀z.y < z < x ⇒ ¬ai(z)
[How can it be written with two variables only? -el] This language corresponds
to the language of a RA with n registers that write in regi while reading ai and
checks wrt to regi while reading bi.
Theorem 4. The above language Ln is SSRE definable.
Proof. [I forgot... -el] The expression
∀concatenationfactor(
n
i=1
(wi.(Σ/wi)∗
=⇒ ((wir∗
i ) (Σ/wi, ri)∗
)))
expresses the above language Ln where ∀concatenationfactor(e)means¬(Σ∗
.¬e.Σ∗
).
Let Lno the language of data words with non-overlapping classes, i.e. Lno =
{w | [i]∼w
is a connex set of positions for all i = 1, . . . , |w|} = {w | w |= ∀x, y.x
+1
∼
y ⇒ x + 1 = y}.
4
5. Theorem 5. Lno is not RA definable, but it is SSRE definable.
Proof. [I forgot... -el]
Let RA has r number of states then consider a word of length r+2 with first r+1
positions distinct data and r+2 position has to be checked with all r+1 previous
positions which is not possible as it has only r registers. Hence the Language
cannot be expressed in RA. The expression
∀shufflefactor(2classes =⇒ 1class.1class)
recognizes above language. where ∀shufflefactor(e) = ¬(¬e Σ∗
)
Theorem 6. The emptiness of SSRE (without homomorphisms) is undecidable.
Proof. The proof uses the idea in section 9 of [BDM+
11]
The proof is by reduction from Post’s Correspondence Problem(PCP).PCP
problem is given a pair of n words (ui, vi) from Σ∗
and the undecidable ques-
tion is there a sequence of numbers i0, i1, . . . , im ∈ {1, 2, . . . , n} such that
ui0
ui1
. . . uim
= vi0
vi1
. . . vim
.
consider a map from each letter in Σ to each letter in Σ such that Σ ∩ Σ = φ
If a word w ∈ Σ∗
,then a word w ∈ Σ ∗
can be constructed by replacing each sym-
bol in Σ with corresponding symbol in Σ consider a new alphabet Σ = Σ ∪ Σ .
Now sequence i0, i1, . . . , im is the solution of PCP iff the Data word w with
following properties is accepted by our SSRE.
1. The string projection is of the form (uivi)∗
where ui ∈ Σ∗
and vi ∈ Σ ∗
which corresponds to the vi ∈ Σ∗
2. all the data in letters in Σ should be distinct and data in all letters in
Σ should be distinct and all the data which appeared in the Σ∗
should
appear in the Σ ∗
and in same order.
Now we encode this properties in SSRE without homomorphisms
1. The language (uivi)∗
is a regular language hence there exists an regular
expression for it
2. (a) all the data in Σ∗
should be distinct for this consider expression
¬((Σ
∗
.Σ.Σ
∗
.Σ.Σ
∗
∧ 1class) Σ
∗
)
(b) all the data in Σ ∗
should be distinct for this consider expression
¬((Σ
∗
.Σ .Σ
∗
.Σ .Σ
∗
∧ 1class) Σ
∗
)
(c) for all the data present in Σ∗
there is corresponding position of Σ ∗
with same data for this consider the expression
¬((Σ ∧ 1class) Σ
∗
) ∧ ¬((Σ ∧ 1class) Σ
∗
)
5
6. (d) all the data in Σ∗
should be in same order in Σ ∗
for this consider
expression
∀shufflefactor(2classes =⇒ (¬(Σ.(Σ
2
∧ 1class).Σ))
hence emptiness of SSRE without is undecidable.
4 Restricted Negation
A SSRE is positive if does not occur underneath a negation.
Theorem 7. If e is a positive SSRE, then L(e) is EMSO2
(< . ∼, +1,
+1
∼) defin-
able.
Proof. [TODO, this is only a sketch of the proof -el] Let ϕe(X) be the EMSO2
(<
. ∼, +1,
+1
∼) with fv(ϕe) = {X} defined by induction on e as follows:
• ...
Then for all ˆX ⊆ {1, . . . , |w|}, it holds that w, X → ˆX |= ϕe iff w| ˆX ∈ L(e),
where w| ˆX =
ai1
di1
. . .
aik
dik
if ˆX = {i1, . . . , ik}. Finally, ∃X.∀x.x ∈ X∧ϕe(X)
defines L(e).
5 Conjectures
Let the language LF L be defined as ∃x, y(x ∼ y ∧ (∀yx < y ∨ x = y) ∧ (∀xx <
y ∨ x = y)) which expresses that first and last position contains different data.
Conjecture 1. The language LF L cannot be represented by SSRE (without
homomorphisms).
Conjecture 2. RA,DA,FO2
(+1, <, ∼) are not comparable with SSRE (without
homomorphisms) in expressiveness of languages of data words.
Proof. The above mentioned language LF L can be represented using RA,FO2
(+1, <
, ∼),DA.
• As the definition is expressed in FO2
(+1, <, ∼) so the language is FO2
(+1, <
, ∼) definable.
• It is RA expressible because the automaton stores the data of the first
position and it nondeterministically guesses the last position and does a
read operation of the same register here.
• It is DA expressible as it the transducer guesses the first and last position
with # rest it keeps the same and the class automata checks whether the
language is (Σ∗
+ #.Σ∗
.#).
6
7. From above conjecture the language LF L cannot be represented in SSRE
(without homomorphisms)
The language ∀x, y.x = y ⇒ x ∼ y describing that all data are distinct can
be expressed using SSRE (without homomorphisms) but cannot be represented
using an RA
• Suppose the language can be represented using an RA.
let RA has r number of states then consider a word of length r+2 with
first r+1 positions distinct data and r+2 position has to be checked with
all r+1 previous positions which is not possible as it has only r registers.
Hence contradiction.
• The expression
¬ (1class ∧ ¬ ∧ ¬
a∈Σ
a)
represents the above language.
The language all the positions have same data and length of the data word
is even cannot represented FO2
(+1, <, ∼) but can be represented using a SSRE
(without homomorphisms)
• The expression
1class ∧ (ΣΣ)∗
represents the above mentioned language.
• The above mentioned language cannot be represented using an FO2
(+1, <
, ∼). It can be shown by playing EhrenfeuchtFrass game on family of words
wn = (
a
d
)2n
andwn = (
a
d
)2n
+1
Let the L be language of data words satisfying the following properties
1. String projection is of the form a∗
ba∗
.
2. The data value at the position of b is distinct from rest of the positions.
3. Rest of the data occurs precisely twice once before b and once after after-
wards
4. The order of the data before b must not be exactly same as the order of
data after b occurs.
[BDM+
11]
The language L can be represented using a DA. The letter letter transducer
changes all the a’s before b to # except it marks two positions and labels first
position x and next y and b to b and all a’s after b to # except it marks two
positions and labels first position x and next y. the class automata accepts the
language
b + ## + xy + yx
7
8. [BDM+
11]
The language L can be represented using a SSRE (without homomor-
phisms). The expression
a∗
ba∗
∧¬((Σ+
.b.Σ∗
∧1class) Σ∗
)∧¬((Σ∗
.b.Σ+
∧1class) Σ∗
)∧¬((Σ∗
.a.Σ∗
.a.Σ∗
.b.Σ∗
∧2classes) Σ∗
)
∧¬((Σ∗
.b.Σ∗
.a.Σ∗
.a.Σ∗
∧2classes) Σ∗
)∧¬(a Σ∗
)∧((a.(aa∧1class).a∧2classes) Σ∗
)
expresses the language L .
It can be shown that ¬L cannot be represented by an DA [BDM+
11].
But SSRE are closed under negations hence ¬L can be represented using a
SSRE(without homomorphisms).
Conjecture 3. The language {
a1
d1
. . .
an
dn
| d1 = dn} is not SSRE definable.
However, it is RA definable with one register.
Open question: can SSRE (without homeomorphisms) express all languages
defined by deterministic RA?
References
[BDM+
11] Mikolaj Boja´nczyk, Claire David, Anca Muscholl, Thomas
Schwentick, and Luc Segoufin. Two-variable logic on data words.
ACM Trans. Comput. Log., 12(4):27, 2011.
[BS07] Henrik Bj¨orklund and Thomas Schwentick. On notions of regularity
for data languages. In Fundamentals of Computation Theory, 16th
International Symposium, FCT 2007, Budapest, Hungary, August
27-30, 2007, Proceedings, pages 88–99, 2007.
[GC11] Murdoch James Gabbay and Vincenzo Ciancia. Freshness and
name-restriction in sets of traces with names. In FOSSACS 2011,
pages 365–380, 2011.
[HMSW11] Tony Hoare, Bernhard Mller, Georg Struth, and Ian Wehrman.
Concurrent kleene algebra and its foundations. The Journal of Logic
and Algebraic Programming, 80(6):266 – 296, 2011. Relations and
Kleene Algebras in Computer Science.
[KF94] Michael Kaminski and Nissim Francez. Finite-memory automata.
Theor. Comput. Sci., 134(2):329–363, 1994.
[KMS15] Dexter Kozen, Konstantinos Mamouras, and Alexandra Silva. Com-
pleteness and incompleteness in nominal kleene algebra. In Rela-
tional and Algebraic Methods in Computer Science - 15th Interna-
tional Conference, RAMiCS 2015, Braga, Portugal, September 28 -
October 1, 2015, Proceedings, pages 51–66, 2015.
8
9. [Koz02] Dexter Kozen. On the complexity of reasoning in kleene algebra.
Inf. Comput., 179(2):152–162, 2002.
[NSV04] Frank Neven, Thomas Schwentick, and Victor Vianu. Finite state
machines for strings over infinite alphabets. ACM Trans. Comput.
Logic, 5(3):403–435, July 2004.
9