Collcom2005 agent basedft

Dynamic Network Reconfiguration in Presence of Multiple Node and Link
Failures Using Autonomous Agents
Juan Ramón Acosta and Dimiter R. Avresky
Network Computing Lab, Northeastern University, Boston, MA
{jracosta,avresky}@ece.neu.edu
Abstract
Currently, high-speed networks are indispensable commodities for
all users and they have become an integral part of their lifestyles.
For this reason, it is necessary for the network to be available most
of the time and to achieve transparent network failure recovery. In
this paper, it is proposed to use Agent NetReconf 1
, an agent based
dynamic network reconfiguration algorithm that is capable of tol-
erating multiple router and link failures in high-speed networks
with arbitrary topology. Agent NetReconf updates the routing ta-
bles asynchronously and does not require any global knowledge
of the network topology. Agent NetReconf uses mobile and au-
tonomous agents to detect and recover the network from failures.
Agent NetReconf highlights the benefits of using smart networking
devices as a means of building an active network. The complexity
of Agent NetReconf is analyzed and the termination, liveliness and
safety are proved.
Keywords: high-speed networks, autonomous mobile agents, dy-
namic reconfiguration, fault tolerance, adaptive routing, arbitrary
topologies
Introduction
The increasing number of users of the Internet has trig-
gered a significant growth in the number of networked de-
vices and the traffic they generate. Computer networks are
now been pushed to their limit. In this context, computing
capacity is available but it can be severely affected by fail-
ures. The major challenge faced by service providers today
is to keep their ability to give customers the level of ser-
vice they require, regardless of system conditions and the
number of faults on the network.
The need to provide increased availability has lead re-
searchers such as Hood and Ji [8] to develop a sophisti-
cated intelligent software agent that performs fault detec-
tion accurately and in certain cases predicts the fault before
1This work was supported by the U.S. National Science Foundation
under grant CCR-0004515
it appears. Others such as Whit et al. [15] have imple-
mented communities of mobile agents that roam the net-
work collecting and exchanging network information based
on the ”social insects” paradigm (ant behavior) described
by Schoonderwoerd et al. [11].
In this paper, an algorithm is proposed for achieving dy-
namic network fault detection and avoidance in arbitrary
topologies using autonomous agents running at each router.
The reconfiguration algorithm is distributed and embedded
in the agents’ behavior. The paper is organized in six sec-
tions as follows: Section 1, presents an overview on agents
and how they are used in adaptive routing. Section 2,
describes a new router architecture that uses autonomous
agents for its routing services. Section 3, describes Agent
NetReconf and how it does the tables reconfiguration to re-
store routing capabilities at the network segment affected by
the failure. Section 4, presents the complexity, termination,
safety and cognitive properties of Agent NetReconf. Section
5, presents a fault recovery example showcasing the algo-
rithm execution. The last section in the paper contains the
conclusions.
1. Autonomous Agents
This section presents an overview of previous work that
has been published on how agents are used to achieve effi-
cient network routing and fault tolerance.
The term agent has been used to refer to a software
and/or hardware component which is capable of acting ex-
actingly in order to accomplish tasks on behalf of its user
[10]. An agent is able to cooperate with other agents, learns
from its environment [17], and sometimes has the capabil-
ity of migrating under its own control from one machine to
another, provided both computers are part of a network.
Agents communicate with other agents to achieve suc-
cessfully all the tasks given to them [16]. Communication
between agents is modeled as a point-to-point exchange of
messages whose content is a construction of a well defined
language, for example: the Knowledge Query and Manipu-
lation Language (KQML) [4] , the Knowledge Interchange

Format (KIF) [14] or, the most recent, the OWL Web On-
tology Langauage [2].
1.1. Applications on Network Fault Tolerance
Minar in [9], describes an algorithm to discover the net-
work topology using mobile agents. The agents travel the
network and from each node they visit they learn its cur-
rent connectivity. In addition, the agents complement the
acquired knowledge by cooperating with other agents they
meet at the same node. Finally, when agents finish explor-
ing the network, the topology is fully discovered, and this
information is then used to define the routing tables at each
node. Agents have also been used in adaptive routing, for
example, Gianni in [3], introduced a distributed adaptive
routing algorithm based on mobile agents that is capable of
learning the routing tables of a computer network using the
ant colony metaphor. Garijo, Cancer and Sánches in [6],
for example, describe a centralized Multi-agent Coopera-
tive Network-Fault Management system (CNFM) that uses
ISO standard interfaces at each router to detect and avoid
faults on the network. In CNFM the agents are working
as watch dogs of the network monitoring each element and
generating events into the CNFM engine when faults are
recognized.
Cynthia Hood and Chuanyi Ji [8], took advantage of
the increasingly available computation power in networking
devices and the benefits of artificial intelligence to design
an intelligent agent that processes information collected by
the Simple Network Management Protocol agents (SNMP-
agents) at each node and uses this information to detect net-
work anomalies that typically precede a fault. “The intel-
ligent agent learns the normal behavior from each reading
made by the SNMP-agent and combines the information us-
ing a Bayesian network that could trigger a local corrective
action or a message to a centralized network manager.” In
a similar approach presented by Phuan and Yufang in [19],
an intelligent mobile agent has the capability to extract data
from a network element using a local high-bandwidth com-
munication session without consuming network resources
and reducing the overall communication traffic. The intel-
ligent mobile agent has the ability to integrate knowledge
from a network manager and any network element to per-
form inferences on which type of fault recovery it will be
necessary to perform.
The algorithm proposed in this paper is different from
the solutions described earlier in that Agent NetReconf ex-
ecutes network failure recovery using only the local knowl-
edge at each router without having to know the network
topology or the type of faulty element (router or link), and
it is platform independent.
2. Agent Based Router
In order for network failure recovery to happen at the ex-
act location where an element failed, it is necessary that the
routing elements in the vicinity take an active role in the
detection and contention of the fault. As mentioned earlier,
network fault recovery and detection is commonly imple-
mented in a way such that a central network monitoring sta-
tion launches all the corrective actions from a remote site,
as seen in [8, 6, 19] and only a few implementations, such
as those described in [1, 5], make the adjacent routers to the
failure participate in the restoration of connectivity.
The authors, in this section, propose an agent based
router in which the detection and reconfiguration tasks are
performed by a group of intelligent agents. The agents are
goal oriented and capable of incorporating new knowledge
learned during the router operation and network reconfigu-
ration.
In essence, the new router is an active intelligent network
device capable of reacting and adjusting its operation based
on the events that occur in its internal and external environ-
ment.
2.1. Architecture
The architecture of the new intelligent router, in Figure
1, is based on a high-speed cross bar switch with an en-
hanced embedded software module that contains an agent
subsystem. For simplicity, the agent platform will not be
specified.
The router hosts a community of agents that are responsi-
ble for controlling the router’s activities and coordinate all
the tasks involved in the dynamic reconfiguration of rout-
ing tables when the router participates in the recovery of
a failure. The knowledge used by the agents to represent
the router, links, neighbors and the execution parameters of
the fault-tolerant reconfiguration algorithm is saved in the
agent’s main memory. The structural representation of the
knowledge is defined using ontology classes written in the
OWL web Ontology Language [2].
The definition of the agents operating the router is as fol-
lows:
1. Node Manager Agent. This agent oversees the opera-
tion of the router and the other agents. The node man-
ager is the router public interface that can be use by
network administration tools, visiting explorer agents,
neighbor routers and other external network elements
to communicate with the router. The manager agent
is also responsible for the security and integrity of the
router; it supervises all the access made to the routing
tables and memory, and makes sure that all the request
made to it are safe. The node manager agent is the

.
.
Arbitration
Decision
Routing
Crossbar
NxN
Tables
0
ii
Input Ports Output Ports
Node Manager
Agent
Router
Agent
AgentRouting
N−1
0
N−1
Link Manager
Figure 1. Agent based router architecture
only component in the router that can initiate a recon-
figuration task. The node manager agent uses a rein-
forcement learning method to acquire new knowledge
to make better decisions during node management and
fault recovery.
2. Router Agent. It is the only agent in the new architec-
ture that can manipulate the routing tables and has the
capability of accepting or declining updates. The agent
behavior is determined by the inherent routing algo-
rithm and the dynamic reconfiguration policies. As
seen in Figure 1, the router’s arbitration and routing
decision logic are controlled by this agent. The router
agent reacts only to requests from the node manager
agent.
3. Link Manager Agent. Responsible for managing the
router’s connected links, ports and queues. The agent
is in charge of detecting and reporting failures and con-
gestion to the node manager. The agent uses a rein-
forcement learning model to learn the characteristic
symptoms before a failure or congestion take place,
this allows the agent to choose the appropriate cor-
rective actions and promptly trigger a restoration task.
The agent uses the “I’m alive” message model to de-
termine failures and the flow-unaware statistical de-
lay method described in [13] to accurately determine
packet delays without depending on the dynamic in-
formation of the packet flow.
4. Explorer Agent. These agents are dynamically cre-
ated in each router when Agent NetReconf is executed.
When an explorer agent is working in search mode
it cooperates with other agents to build a restoration
spanning tree that will re-connect the nodes discon-
nected by the failure. When an explorer agent is work-
ing in restoration mode, it collaborates with the node
manager agents at each router on the restoration tree
to update the local router tables. An explorer agent is
a delegate of the router that created it, such that any
interaction between two different agents is equivalent
to the two routers interacting directly point-to-point.
3. Network Failure Recovery
3.1. Agent NetReconf
This section describes a new dynamic network reconfig-
uration algorithm Agent NetReconf. The algorithm uses a
set of collaborative agents to restore network connectivity
after a failure is detected. Agent NetReconf is a distributed
intelligent algorithm that operates at the network level with-
out any global information of the network topology.
The strategy used by Agent NetReconf consists in iden-
tifying the set of nodes adjacent to a failure and from them
selecting a leader to coordinate the construction of a restora-
tion spanning tree and synchronize the updates to the rout-
ing tables at each node on the restoration tree.
The complete reconfiguration process consists of four
phases: Leader Selection, Restoration Tree Construction,
Reconfiguration Synchronization and Tables Update. The
correct execution of these phases is subject to the validity
of the following assumptions:
Assumption 3.1 After a failure F is detected, no additional
failures will occur on any link or node that belongs to the
restoration tree, until Agent NetReconf finishes the recon-
figuration process for F.
Assumption 3.2 The network is not partitioned as result of
the failures.
Before describing in detail each phase, for clarity, con-
sider R to be the set of all routers in the network and that
each router Ri is connected to N other routers, its imme-
diate neighbors. Also let Sij be the collection of IDs of all
routers that are two hops away from Ri via link Lj. Addi-
tionally, assume that each Lk is monitored and managed by
one of the link manager agents (LMk). At each router Ri,
the link manager LMk that detects missing “I’m alive” mes-
sages from link Lk, immediately notifies the Node Manager
Agent (NMi) by raising the asynchronous NetworkFailure-
Detected event.
Leader Selection After the failure is detected by router Ri,
the node manager NMi suspends the traffic targeting Lk,
the link leading to the presumed faulty node. From Sik,
NMi selects the ID with the highest value and records it in
memory as the ID corresponding to the Restoration Leader
(RLF ). If the selected ID equals Ri’s ID then Ri becomes
the leader and immediately starts Phase 1. Otherwise, when
the selected ID does not match Ri’s, the router starts timer

Tstart and waits for a control signal from RLF that indi-
cates that the node can join Phase 1. If Tstart times out and
no signal from RLF was received, Ri marks RLF faulty
and starts the leader selection again.
Definition 3.1 Node Adjacent to Failure (NAF) It is a node
that was not selected “Restoration Leader” and was di-
rectly connected to a node or link that failed.
Phase 1. Restoration Tree Construction The first step in
Agent NetReconf is to build a restoration tree to establish a
communication path between the leader and the NAFs.
Step 1a. Begin Phase
Phase 1 starts with the Restoration Leader RLF (
RLF = Ri ) creating one explorer agent Eij per active
link Lj. Eij is initialized in search mode and is provided
with the list of disconnected NAFs. Eij makes Ri its home
and starts the search for NAFs by migrating to the neighbor
connected to Lj.
After all Eij migrated out of the leader node, RLF starts
timer Tack and waits for the arrival of control signals con-
firming that a restoration path was found between RLF and
each NAF.
Step 1b. Searching for NAFs
As the explorer agent Eij arrives at a node Rx, it adds the
ID of the visited node to the restoration path it is building.
Eij exchanges information with the current node and uses
this information to define an itinerary for its next migration.
If the explorer agent did not arrive at a NAF, then it
uses the information to create clones of itself to help it con-
tinue searching. The itinerary and the number of clones are
based on the number of active links and the available feasi-
ble routes to the NAFs. For example, in Figure 2, explorer
EH3 learns from RE that there are two active links L0 and
L3, and one feasible route via L3. NAFs {C,D} are pre-
sumed to be reachable through L3 and {A,B} will need to
be searched via L0. This implies that at least two clones are
required. However, since RE is not a NAF then EH3 can
continue searching. Therefore, only one clone is required
for the next migration.
When Eij arrives at a NAF, the explorer agent removes
Rx from the list of NAFs and tells NMx to save the restora-
tion path Eij traveled. Then, NMx stops Tstart and creates
an agent explorer for restoration ERxi that sends back to
the restoration leader Ri to confirm that the restoration path
was found. Although, Eij reached a NAF the search needs
to continue for the remaining NAFs in the list. Eij then
creates clones and their itinerary following the same crite-
ria mention before. Each clone then continues the search.
Meanwhile Eij stays at Rx and starts timer Tphase3 to wait
for a signal from RLF to start Phase 3. The case in which
Tphase3 times out represents a situation in which a failure
might occurr during reconfiguration. However, based on
Assump. 3.1, this will not occur.
Cycles are prevented in the restoration paths by deacti-
vating an explorer agent when it arrives a node that has been
visited already by either itself, one of its clones or one of its
siblings.
In order to distinguish between node and Link Failures,
Agent NetReconf uses explorer agents as follows: If a NAF
receives an Eij from a node which is assumed to be faulty,
then a link failure is identified, therefore the NAF must up-
date its reconfiguration information for the node and mark it
safe. In the case in which two nodes, each at the end of the
faulty link, may have determined that both are restoration
leaders for the link failure, it is required to synchronize the
nodes such that only one leader remains. The synchroniza-
tion will occur when both nodes receive an explorer agent
from each other, Eij and Eyj. The restoration leader for
the faulty link will be the parent of the explorer agent that
has the highest ID value. For example Ry, parent of Eyj,
becomes the restoration leader for the failed link and node
Ri becomes a NAF. After the leader synchronization has
occurred Agent NetReconf will continue with the reconfig-
uration.
Step 1c Establishing Tree
At each node Rj that is on the path followed by ERxi,
NMj marks the links on which ERxi arrives and departs
members of the restoration tree. Furthermore, if NMj de-
tects that a different ERxy, from leader Ry, has already
visited the node, then to avoid any conflicts with the recon-
figuration, it gives ERxi the information about Ry, such
that when it gets to Ri this can synchronize with Ry before
it proceeds with Phase 3. ERxi continues migrating until it
reaches the restoration leader.
When Tack times out at the restoration leader, RLF de-
termines which NAF did not reply with an ERxi in order to
mark it faulty and exclude it from the reconfiguration. The
restoration leader continues and builds the restoration tree
by merging each root of the confirmed restoration paths.
After the restoration tree is completed, each ERxi sends a
point-to-point Restoration Tree Built (RBT) message signal
to its parent.
Definition 3.2 Node On Restoration Tree (NORT)
It is a node that has at least one link belonging to the
restoration tree.
Phase 2. Multiple Failure Synchronization
When multiple failures appear, Agent NetReconf estab-
lishes an ordered sequence of priorities between the restora-
tion leaders detected by the visited NORTs, such that the
reconfigurations occurs in a “safe” sequence in which the
restoration leader with the highest ID always executes Phase
3 first, while the others await their chance. For example, if
we assume that Ry’s ID is higher than Ri’s then it will pro-
ceed to Phase 3 before Ri.

Phase 3. Routing Information Update
This phase starts with a NAF processing an incoming
RTB message and providing new routing information to the
awaiting Eij. After the information exchange finishes the
explorer agent starts migrating back to RLF using the ac-
knowledged restoration path. As Eij travels back to RLF ,
the node manager of a visited node exchanges routing in-
formation with Eij and if necessary it updates its rout-
ing tables. Eij continues migrating until it reaches RLF .
The information given to Eij by the NAF, and each visited
node, includes the IDs of all destinations that are reachable
through each of these nodes using links that do not belong
to the restoration tree.
Upon arrival to the restoration leader, Eij delivers to
RLF the routing information it collected. RLF processes
the data to adjust its routing tables and deactivates Eij.
When RLF completes the update, it then provides each
ERxi with the IDs of all the destinations reachable through
its active links excluding the link on which the ERxi arrived
and then ERxi migrates to its parent NAF. After all restora-
tion explorers have migrated, RLF starts a timer Tcomplete
to wait for a confirmation signal from each NAF indicating
that the updates were completed and that they are ready to
resume operations. The case in which Tcomplete times out
represents a case similar to that described earlier and will
be dealt with in a future publication..
As ERxi travels back, the node manager of a visited
node exchanges routing information with ERxi and if nec-
essary it updates its routing tables. ERxi continues travel-
ing until it reaches its parent NAF. The routing information
provided by the visited node includes the IDs of all the des-
tinations reachable through the visited node using the links
that belong to the restoration tree with the exception of the
IDs of nodes accessible via the links on which ERxi arrives
and leaves the visited node.
Upon arrival of ERxi, Rx updates its routing tables
with the information contained in the restoration explorer
and ERxi is deactivated. NAF sends RLF a point-to-point
Update Complete Response (UCR) message signal. When
RLF receives the UCR signal, it stops Tcomplete and re-
sumes normal operations.
The reconfiguration algorithm, as described, uses to the
maximum the ability of the agents to interact with each
other. Communication between the explorer and node man-
ager agents are performed mostly within the router’s agent
module, only a very few leave the router and happen in a
point-to-point form. This is an important contribution of
Agent NetReconf because it maintains the algorithm execu-
tion distributed at each router and keeps to a minimum the
overhead on the bandwidth usage and the number of links
preempted for the reconfiguration to work.
Agent NetReconf bases its execution on the natural abil-
ity of autonomous agents to acquire and share knowledge,
for instance, when the explorer agents are searching for
NAFs they learn information at each node that helps them
design an optimal migration pattern that reduces network
flooding significantly.
4. Properties of Agent NetReconf
4.1. Complexity
The complexity of Agent NetReconf is analyzed in terms
of the number of explorer agents created during restora-
tion tree construction and routing table reconfiguration. Let
LActive be the number of active links on each router, nfin
the number of NAFs for failure F, and P a path between
RLF and a NAF.
Theorem 4.1 The complexity for Agent NetReconf for mul-
tiple failures is given by
O(LActive ∗ ((nmax ∗ Pmax) + 1))
where Pmax is the longest path connecting RLF and any
NAF and nmax is the maximum number of NAFs.
Proof: Agent NetReconf determines RLF without cre-
ating explorer agents such that leader selection is achieved
with O(0) complexity.
In Phase 1, when the recovery step initiates, RLF creates
LActive exploration agents Eij, one per active link. The
corresponding complexity for this operation is O(LActive).
As an explorer agent migrates searching for target NAFs,
the maximum number of explorer agents created at the vis-
ited node Rx as described in Phase 3 is LActive − 1. In
cases where Rx is a NAF, Rx creates one exploration agent
for recovery, ERxi, such that the maximum number of ex-
plorer agents created at an intermediate router is LActive.
Now, considering that the longest restoration path be-
tween RLF and a NAF is Pmax, the total number of ex-
plorer agents needed to continue searching for a NAF is
Pmax ∗ LActive. Considering the worst case in which each
NAF is reached via a disjoint restoration path, the total num-
ber of explorers created is given by nmax ∗ Pmax ∗ LActive.
Assuming that all restoration trees intersect, Phase 2 is
executed independently for each RT without creating any
agents, which results in O(0) complexity.
Then, by adding the number of agents created by
the restoration leader, the complexity of Agent NetRe-
conf becomes O((nmax ∗ Pmax ∗ LActive) + LActive), or
O(LActive ∗ ((nmax ∗ Pmax) + 1)). Q.E.D 2
Now, by comparing O(LActive ∗ ((nmax ∗ Pmax) + 1))
with the complexity of NetRec in [1], which is O(N ∗ (L +
nmax ∗Pmax +N ∗Pmax)), it is clear that Agent NetReconf
reduces the complexity of NetRec by more than one order

of magnitude. This is explainable by the fact that LActive is
expressed in terms of the number of active links instead of
the total number of links in the network. The improvement
presented here is possible because in Agent NetReconf the
agents are using their knowledge to make inferences and
execute actions that otherwise, in standard NetRec, would
require several point-to-point message exchanges. This, in
fact, is a powerful feature of agent based systems as is men-
tioned in [14].
4.2. Termination
The following agent migration patterns and message de-
livery properties are used for proving Agent NetReconf’s
Termination.
Definition 4.1 If a point-to-point message is sent from a
source agent S to a destination agent D, then it will be re-
ceived once and only once by D.
Definition 4.2 Every point-to-point message sent between
an exploration agent Eij or ERxi and a node manager
agent NMx will be routed following a path on the restora-
tion tree and will be reliably delivered to its destination.
Definition 4.3 The restoration leader RLF considers an
arriving ERxi to be the acknowledgment sent from a NAF
to confirm that a restoration path has been created.
Definition 4.4 The restoration leader RLF considers a re-
turning Eij to be the acknowledgment sent by a NAF to con-
firm that a restoration tree was established and the request
to update its routing tables with the information carried by
Eij .
Definition 4.5 A NAF considers a returning ERxi to be the
acknowledgment sent by the RLF that it updated its routing
information and that the NAF must update its table with the
new information carried by ERxi
Lemma 4.1 For a given faulty node F, all NAFs will elect
the same RL.
Proof: We prove by contradiction. Suppose that two
NAFs will elect different RLs. Since the router with highest
ID among the NAFs is elected for RL, then these two NAFs
must have used different NAF sets. However, all NAFs are
two hops from each other through F and by definition each
NAF knows its own ID and the IDs of all routers that are
two hops away from it. Thus, the NAF sets determined by
the NAFs cannot be different, which contradicts the suppo-
sition. Q.E.D. 2
Lemma 4.2 For a given fault F, the RLF and all the NAFs
will successfully establish a restoration tree rooted at RLF
such that Agent NetReconf can start the reconfiguration
step.
Proof: According to Lemma 4.1, all non-faulty NAFs will
elect the same RLF . Phase 3 and Def. 4.3 assure that a NAF
is reached by RLF and that the restoration path is estab-
lished. By sending a Restoration Tree Built (RTB) message,
as described in Phase 3, it is guaranteed that a NAF is no-
tified that the restoration tree was established. Def. 4.1 and
4.2 assure that this point-to-point message is delivered to
its destination reliably. Finally, Def. 4.4 assures that both
RLF and NAFs receive the routing information describing
the restoration tree. Therefore the restoration tree is reliably
established. Q.E.D 2
Lemma 4.3 For a given failure all NAFs, NORTs and RLF
successfully update their routing tables and Agent NetRe-
conf execution terminates.
Proof: Since Lemma 4.2 assures that the restoration tree
is reliably established, then from Phase 3, it is assured that
new routing information is collected by the explorer agents.
Def. 4.4 assures that RLF receives the new information and
updates its table before any NAF. Def. 4.5 guarantees that
the NAFs receive new information after RLF completes its
updates. Phase 3 makes sure that RLF knows that a NAF
finished updating and that it is ready to resume operations.
Q.E.D. 2
Lemma 4.4 All the explorer agents Eij and ERxi deacti-
vate.
Proof: By Def. 4.4, an Eij explorer returns home after
the restoration tree RTF has been established. Phase 3 as-
sures that Eij deactivates after the RLF updates its routing
information. Similarly, Def. 4.5 assures that ERxi returns
home and deactivates after the NAF updates its table. In ad-
dition, Phase 3 assures that the Eij that were created and
never reach a NAF will deactivate. Q.E.D. 2
Lemma 4.5 In the presence of multiple intersecting
restoration trees, none of the intersecting RLs will remain
forever in Phase 2.
Proof The goal of Phase 2 is to ensure that at any given
time only RLs with non-intersecting restoration trees will
be executing Phase 3, in which the routing information is
updated. In the cases of consecutive failures and simulta-
neous disjoint failures, this is always true, so Phase 2 is
skipped and the RLs will proceed to Phase 3 independently
from each other. If there are simultaneous failures with in-
tersecting restoration trees, then their RLs must establish
such order, which results in a sequence of temporally dis-
joint reconfigurations around single failures or simultane-
ous disjoint failures.
For each two intersecting restoration trees there is at
least one joint node, which detects the intersection. This

guarantees that at least one of the RLs in each intersection
will be notified about it. The temporal order is established
by the intersecting RLs based on their node IDs - nodes
with higher IDs have higher priority. All lower priority RLs
will wait in Phase 2 until all higher priority RLs have com-
pleted Phase 3. Following the algorithm, after completing
Phase 3, each RL notifies all lower-priority RLs, which al-
lows the next leader in the temporal order to execute Phase
3. Thus, all leaders that were waiting in Phase 2 will even-
tually receive the required synchronization messages that
allow them to proceed to Phase 3. Q.E.D. 2
Theorem 4.2 On all nodes Agent NetReconf will success-
fully complete in the presence of multiple failures, i.e. Agent
NetReconf will terminate and the nodes adjacent to the fail-
ures will be reachable.
Proof: Based on Lemmas 4.1 - 4.5, it can be concluded
that the RLF and the NAFs will proceed with all phases
of Agent NetReconf and will generate the required explorer
agents to carry out the establishment of the restoration tree
and the reconfiguration of each node (RLF , NAFs and
NORTs) on the tree. Q.E.D. 2
4.3. Liveliness
In this section is proved that on completion of Agent Ne-
tReconf the network will be reconfigured appropriately.
Theorem 4.3 On completion of Agent NetReconf, all con-
nected nodes in the network are reachable.
Proof: The appearance of a failure causes all the paths
that go through the faulty link or node to be bisected. The
results are segments of unreachable nodes where each seg-
ment begins with a NAF. By Assumption 3.2, the network is
not partitioned, such that all connected nodes are reachable
through non-faulty physical paths. Lemma 4.2 assures that
all the NORTs and NORTs are reachable through a spanning
tree rooted at the NAF acting as restoration leader. During
the recovery phase, Lemma 4.3 guarantees that all the nodes
on the restoration tree have their routing tables updated in
a way such that all the faulty segments are replaced with
restoration paths. Theorem 4.2 demonstrates that Agent Ne-
tReconf will terminate for any single failure by executing
a “safe” sequence of reconfigurations that are performed
synchronously and coordinated by the restoration leader.
Q.E.D. 2
4.4. Safety
The goal of this section is to define and prove the safety
property of Agent NetReconf, namely, avoidance of infinite
loops and cyclic dependencies
Theorem 4.4 Agent NetReconf does not create infinite
loops or cyclic dependencies.
Proof: Cyclic dependencies among the nodes on the
restoration tree will not be created, because Step 3.1 pre-
vents any explorer agents Eij in search mode to either re-
turn back to the RLF or continue exploring if the current
visited node was already visited by another Eij from RLF .
Lemma 4.5 proves that no restoration leader will be blocked
forever in Phase 2. As well, cyclic dependencies between
the RLs cannot arise, because they are resolved by always
giving priority to the nodes with higher ID or nodes that are
already in Phase 3.
In the presence of multiple failures, the RLs will enter
Phase 3 in the priority order, which was established in Phase
2, i.e., at any time only RLs with disjoint restoration trees
are permitted to concurrently execute Phase 3. Therefore,
cyclic dependences cannot be formed between the RLs. The
RL-NAF relations are based on a strict request-response
model, so there are no cyclic dependencies between them.
Since all possible faulty NAFs have been isolated from the
restoration tree in Phase 1 and all reconfiguration messages
are reliably delivered, all loops in Phase 3 will terminate
after the corresponding messages are received. Q.E.D. 2
4.5. Cognitive Properties
Having autonomous mobile agents execute the algorithm
in parallel at each router reduces the required point-to-point
interactions between the restoration leader and the NAFs.
For instance, two agents would only exchange point-to-
point messages when necessary, otherwise they will work
with the knowledge that exists at each node, and the knowl-
edge they acquire from other agents during the construction
of the restoration tree or the reconfiguration phase.
To have agents execute the recovery algorithm allows
keeping the knowledge of a failure closer to where it hap-
pened instead of widely spreading the information to other
elements that are oblivious of such a fault. Also, with
agents, more intelligent interactions occur between routers.
For example, the manager NMi at RLF knows that the ar-
rival of an ERxi is the confirmation that the NAF is alive
and the path followed by an Eij is the desired restoration
path. Similarly, if an ERxi returns home it is known to the
NAF that the restoration leader has completed updating its
routing information and that it is its turn to do the same.
The lower complexity in Agent NetReconf, allows the al-
gorithm to scale because it only involves a small number of
links, as was proved in Section 4.1.
In Agent NetReconf, an explorer agent represents more
than one message type of those used in message based al-
gorithms such as [1, 5], and without oversimplifying, an

agent is considered a smart message that has cognitive and
evolutive capabilities.
These cognitive properties allow the reconfiguration al-
gorithm to execute faster, because the agents are retrieving
the information from the data knowledge base at the router
and do not have to wait for synchronous acknowledgment
from any router. The use of agents in the reconfiguration
algorithm helps reduce the number of message exchanges,
the number of links used in the reconfiguration and allows
an agent to make an optimal selection of the link that leads
to the next node.
5. Examples of Failure Recovery
5.1. Node Failure Recovery
To illustrate the behavior of Agent NetReconf for recov-
ering a node failure, consider that router R fails on the net-
work shown in Figure 2. After a TIamAlive timeout expires,
routers {A, B, C, D, E, H} detect the failure F. Each router
then becomes a Node Adjacent to Failure (NAF) and in par-
allel they start selecting a restoration leader RLF .
F
G
H
1
0
0
1
2
3
0
1
0
2 1
4 5
3
0
1
23
0
1
2
1
2
2
1
0
2
ED
C
B A
30 E
2
1
0
E0,H
ER
ER
B
C
E
E
1
3,H
E3,H
1
E3,H
2
E3,H
2
E3,H
3
3
3,H
E
ERD
ER E
E3,H
0
R
1,H
E
1
1,H
E
3,H
Figure 2. Node failure recovery
Phase 0. In D, NMD queries SD1, its knowledge base,
and determines that router H has the highest ID among
the others that are two hops away via link L1. Similarly,
{A, B, C, D} select H as RLF and then become NAFs.
Phase 1. At H, NMH creates three explorer agents
EH0, EH1 and EH3, one per active neighbor. Each agent
learns the list of NAFs and starts migrating, searching for
NAFs. Consider EH3. the explorer when it arrives RE
learns that there are two active links L0 and L3, and one fea-
sible route via L3. NAFs {C,D} are presumed to be reach-
able through L3 and {A,B} will need to be searched via L0.
This implies that at least two clones are required. However,
since RE is not a NAF then EH3 can continue searching.
As each explorer reaches a NAF, a restoration explorer is
sent to RLF . At RLF , when ERAH, ERBH, ERCH
and ERDH arrive, the restoration tree is considered built,
shown with black lines in Figure 2.
Phase 2. Since there are no overlapping restoration trees,
the agents move to the next phase.
Phase 3. Each ERxi sends a point-to-point RTB message
back home to make each Eij return back to RLF . Each
Eij on its way back learns routing information that it later
shares with RLF .
Table 1. Router D, original table
Dest Port Dest Port
A 1 F 0
B 1 G 1
C 2 H 1
D - R 1
E 0
Table 2. Router D, updated table
Dest Port Dest Port
A 0 F 0
B 0 G 0
C 2 H 1
D - R 1
E 0
When all Eij have arrived, RLF determines the destina-
tions that can be reached through its active links and gives
to each ERxi a list from which it excludes the destina-
tions reachable through the port on which ERxi came in.
ERDH, for example, will be provided with {A, B, F, G}.
On its way home, each node visited by ERDH provides the
destinations reachable through links belonging to RTF ex-
cluding those reachable through the links on which ERDH
arrived at and departed from the node. When ERDH gets
home, it asks NMD to update its routing tables with the
information that it is carrying. After NMD finishes updat-
ing its table, it sends a point-to-point UCR confirmation to
RLF . The table for router D after the reconfiguration is
complete is as shown in Table 2
5.2. Link Failure Recovery
The following example illustrates the behavior of Agent
NetReconf recovering a link failure. Assume that the link
connecting routers J and K fails in Figure 3. After the
TIamAlive timeout expires, routers J and K start the leader
selection phase and both routers assume that its neighbor, at
the other end of the link, has failed.
Phase 0. During leader selection, router J is selected
restoration leader RLJ by routers {A, C, D}. Likewise,
router K is selected restoration leader RLK by routers
{E, G, H, I}.

S D
S A
S B
S F
S E
E K,3
S
E K,3
E J,4
0
E K,3
E K,3
E J,4
0
E J,4
0
J K C
D
A
G
I
F
B
H
E
0
1
2
3
4
0
1
2
0
1
23
4
0
12
3
4
0
1
2
3
0
12
3
01
2
0
1
2 3
4
Figure 3. Link failure recovery
Phase 1. At J, four explorer agents are created:
EJ1, EJ2, EJ3 and EJ4. At K, three explorer agents
are created EK0, EK1 and EK3. To start building the
restoration paths, the explorers from each leader start mi-
grating to search for the known NAFs to each leader. In
the search process, explorer agents EK3 and EJ4 arrive
at restoration leaders RLJ and RLK respectively. With the
arrival of the explorers both leaders realize that the router
they presumed failed is indeed alive. Both leaders mark
faulty the link that connected them and move to determine
which is the new role of the supposedly faulty node in this
phase. Router J determines that router K’s ID is higher and
becomes a NAF belonging to RLK. Router J then issues a
deactivate point-to-point message to all its explorers to indi-
cate it is no longer the leader, see pseudo-code in Appendix
A. After the new role is assumed by J, Phase 1 continues as
described in section 3.1. Note that EK3 stays at J since it
became a NAF.
Phase 2. Since there are no overlapping restoration trees,
the agents move to the next phase.
Phase 3. Each ERxi sends a point-to-point RTB message
back home to make each EKj return back to RLK. Each
EKj, on its way back learns routing information that it later
shares with RLK. Phase 3 continues as described in section
3.1 to the end. The table for router J after the reconfigura-
tion is complete is as shown in Table 4
6. Conclusions
This paper has presented Agent NetReconf, a dynamic
network reconfiguration algorithm that uses collaborative
agents. It was proved by complexity analysis that Agent Ne-
tReconf is significantly more efficient than message based
algorithms [1, 5], and reduces by more than one order
of magnitude the number of interactions and message ex-
changes required to perform the network reconfiguration as
was explained in Section 4.1.
The improvement in complexity achieved in Agent Ne-
tReconf is based on the fact that all the agent interactions
Table 3. Router J, original table
Dest Port Dest Port Dest Port
A 0 F 3 SB 2
B 2 G 4 SD 0
C 0 H 2 SE 1
D 0 I 3 SF 3
E 1 SA 0
Table 4. Router J, updated table
Dest Port Dest Port Dest Port
A 4 F 3 SB 2
B 2 G 4 SD 4
C 4 H 2 SE 1
D 4 I 3 SF 3
E 1 SA 4
occur at each router and the number of point-to-point non-
in-router communications are minimal.
Another important, but not obvious, contributor to Agent
NetReconf’s reduction in complexity, is the representation
of agent knowledge as an OWL ontology. Using OWL sim-
plifies dramatically the way in which agents exchange in-
formation. For example, during the Leader Selection an
agent will only have to make a query to the router’s knowl-
edge base specifying that it needs to know the neighbor with
the highest ID that is two hops away. Querying the OWL
knowledge base is executed in constant time and does not
require any agents to be created such that its contribution to
the communication complexity is zero. This is mainly be-
cause the queries are executed locally and never leave the
current router. This last property assures that there is no
need for the agents, nor Agent NetReconf, to use any global
network information.
The combination of the agent based architecture and
Agent NetReconf represent an important contribution to ac-
tive networking because the network takes control of all its
tasks and uses intelligence as a way to provide improved
reliability and quality routing.
The cognitive properties of the agents allow the reconfig-
uration algorithm to execute faster, because the agents are
retrieving the information from the data knowledge base at
the router and do not have to wait for synchronous acknowl-
edgment from any other router. This facilitates the optimal
selection of the link that leads to the next node during the
reconfiguration.
To conclude, Agent NetReconf is a low complexity, in-
telligent distributed dynamic network reconfiguration algo-
rithm that is applicable to network computers with arbitrary
topologies, is application-transparent and is capable of iso-
lating and tolerating multiple faulty links or nodes.

References
[1] D. Avresky and N. Natchev. Dynamic Reconfiguration in
Computer Clusters with Irregular Topologies in the Presence
of Multiple Node and Link Failures. IEEE Transactions on
Computers, 55(2), May 2005.
[2] N. Bennacer, Y. Bourda, and B. Doan. Formalizing for
Querying Learning Objects Using OWL. In Proceedings of
IEEE International Conference on Advanced Learning Tech-
nologies, pages 321–325, 2004.
[3] G. D. Caro and M. Dorigo. Mobile Agents for Adaptive
Routing. In Proceedings of 31st International Conference
on System Sciences (HICSS-31), 1998.
[4] H. Chalupsky, T. Finin, R. Fritzson, D. McKay, S. Shapiro,
and G. Weiderhold. An Overview of KQML: A Knowl-
edge Query and Manipulation Language. Technical report,
KQML Advisory Group, Apr. 1992.
[5] J. Duato, R. Casado, A. Bermúdez, and F. J. Quiles. A Pro-
tocol for Deadlock-Free Dynamic Reconfiguration in High-
Speed Local Area Networks. IEEE Transactions on Parallel
and Distributed Systems, 12(2):115 – 132, February 2001.
[6] M. Garijo, A. Cancer, and J. Sanchez. A Multi-Agent Sys-
tem for Cooperative Network-Fault Management. In Pro-
ceedings of the First International Conference and Exhibi-
tion on the Practical Applications of Intelligent Agents and
Multi-agent Technology, pages 279 – 294, 1996.
[7] M. Heusse, S. Gu’erin, D. Snyers, and P. Kuntz. Adaptive
Agent-Driven Routing and Load Balancing in Communica-
tion Networks. Complex Systems, 1998.
[8] C. S. Hood and C. Ji. Intelligent Agents for Proactive
Fault Detection. IEEE The Internet Computing, 2(2):65–72,
March – April 1998.
[9] N. Minar, K. H. Kramer, and P. Maes. Cooperating Mobile
Agents for Mapping Networks. In Proceedings of the First
Hungarian National Conference on Agent Based Computa-
tion, 1999.
[10] H. S. Nwana. Software Agents: An Overview. Knowledge
Engineering Review, 11(3):205–244, Oct./Nov. 1995.
[11] R. Schoonderwoerd, O. E. Holland, J. L. Bruten, and L. J. M.
Rothkrantz. Ant-Based Load Balancing in Telecommunica-
tions Networks. Adaptive Behavior, 5(2):169–207, 1996.
[12] D. L. Tennenhouse, J. M. Smith, W. D. Sincoskie, D. J.
Wetherall, and G. J. Minden. A Survey of Active Network
Research. IEEE Communications Magazine, 35(1):80–86,
1997.
[13] S. Wang, D. Xuan, R. Bettati, and W. Zhao. A Study of Pro-
viding Statistical QoS in a Differentiated Services Network.
In NCA’03, Proceedings of IEEE International Symposium
on Network Computing and Applications, pages 0297–0304,
2003.
[14] G. Weiss. Multi Agent Systems, A Modern Approach to Dis-
tributed Artificial Intelligence. MIT Press, 2001. ISBN:
0-262-23203-0.
[15] T. White, A. Bieszczad, and B. Pagurek. Distributed Fault
Location in Networks Using Mobile Agents. In IATA
1998,Proceedings of the Second International Workshop
on Intelligent Agents for Telecommunication, volume 1437,
1998.
[16] M. J. Wooldridge. The Logical Modeling of Computational
Multi-Agent Systems. PhD thesis, University of Manchester,
1992.
[17] M. J. Wooldridge and N. R. Jennings. Intelligent Agents:
Theory and Practice. Knowledge Engineering Review,
10(2):115–152, June 1995.
[18] Y. Yemini and S. daSilva. Towards programmable networks.
In Proceedings of IFIP/IEEE International Workshop on
Distributed Systems: Operations and Management, 1996.
[19] P. Zhang and Y. Sun. A New Approach Based on Mobile
Agents to Network Fault Detection. In ICCNMC’01, Pro-
ceedings of the International Conference on Computer Net-
works and Mobile Computing, 2001.

Collcom2005 agent basedft

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (18)

En vedette

En vedette (8)

Similaire à Collcom2005 agent basedft

Similaire à Collcom2005 agent basedft (20)

Dernier

Dernier (20)

Collcom2005 agent basedft