Talon systems - Distributed multi master replication strategy

TalonStore: A Multi Master replication Strategy
Vishant Bhole
Illinois Institute of Technology
Chicago, US
vbhole@hawk.iit.edu
Saptarshi Chatterjee
Illinois Institute of Technology
Chicago, US
schatterjee@hawk.iit.edu
Abstract –
Data Replication is the process of storing data in
more than one site or node.It is useful in improving the
availability of data. The result is a distributed database
in which users can access data relevant to their tasks
without interfering with the work of others.Data
Replication is generally performed to provide a
consistent copy of data across all the database nodes.
Traditionally it’s done by copying data from one
database server to another, so that all the servers can
have the same data. Our implementation, proposes a
completely different approach. Instead of copying data
from one node to another, in our design , master replicas
do not directly communicate between each other and
work virtually independently for write queries. For read
queries, an independent process consults all the replicas
to constitute a quorum. and returns the result if
majority of the machines in the system response with the
same result .
Keywords: Multi Master replication, Distributed
Systems,Big Data, kafka, Dynamo, RDBMS, Nosql.
I. INTRODUCTION
Synchronous Multi-master replication, ensures that we
can write to any node and be sure that the write will be
consistent for all nodes in the cluster at any given point of
time. With multi-master replication any write is either
committed on all nodes or not committed at all.
Multi-master replication is also known as advanced
replication or symmetric replication and allows us to
maintain multiple sets of identical data at various sites. In a
multi master setup redirecting read queries towards any of
the master should yield the same output. Other benefits of
Multimaster replication are -
● Load balancing - It refers to effectively handling
incoming network traffic across a group of servers.
● Fault Tolerance - It enables a system to continue
operating properly in case of failure of some server.
● Increase data locality and availability - Ensures
that when any client request the data it gets back to
them from nearest server
Multi-master replication varies in a way from
master-slave replication, in which a single member of the
group is elected as the "master" for a portion of data and is
the only node allowed to modify that portion of data. Other
members willing to modify the data must contact the master
node. Allowing only a single master makes it easier to
achieve consistency among the members of the group, but is
less flexible than multi-master replication.
Synchronous Multi-master replication can likewise be
stood out from asynchronous replication where detached
slave servers are duplicating the master information with the
end goal to get ready for takeover if the master quits
working. Asynchronous replication supports eventual
consistency whereas we can implement a strictly consistent
system and distributed transaction using Synchronous
Multi-master replication.
In this article we researched related work in this field,
how existing systems implement Multi Master replication ,
their benefits and limitations. Based on our research we
propose a new approach, where master replicas doesn’t
acquire lock on other replicas, and works parallely for write
queries. And for For read queries, a separate process consults
all the replicas to constitute a quorum.
Fig. 1 Multi-master replication system [8]
II. RELATED Work
We researched on many of the similar architecture from
the existing technologies such as RDBMS implementation by
Oracle, Nosql database like MongoDB, HDFS, Kafka
replication strategies, Cassandra, Amazon’s Dynamo. Our
findings are described in the next section.
A. Oracle’s implementation of Multi master Replication
Oracle supports two types of multimaster replication:
Asynchronous replication: captures any local changes,
stores them in a queue, and, at regular intervals, propagates
and applies these changes at remote sites. With this form of
replication, there is a period of time before all sites achieve
data convergence.

Synchronous replication: Applies any changes to all sites
participating in the replication environment as part of a
single transaction. If the propagation fails at any of the
master sites, then the entire transaction, including the initial
change at the local master site, rolls back. Ensures data
consistency across the replication environment. There is
never a period of time when the data at any of the master
sites does not match. Hence strict consistency is enforced.
Oracle first locks the local row and then uses an
AFTER ROW trigger to lock the corresponding remote row.
Oracle releases the locks when the transaction commits at
each site.
Supports distributed Transaction.
Fig 2. Oracle’s implementation of Synchronous replication
There are a few limitations of this approach
● Distributed transactions are complex. Local
changes need to roll back if any participating
system fails.
● If one of the participating node goes down or
replies slow, then entire system can’t accept write
queries. Hence such a system is very fragile.
● Can’t handle byzantine faults, as nodes answer to
read queries directly without constituting a quorum
B. MongoDB data replication
We failed find any strong proof that MongoDB
supports synchronous Multi master replication .Distributed
MongoDB cluster consists of a group of mongod instances
(replica set ) that maintain the same data set. A replica set
optionally contains one arbiter node and consists of several
data nodes. Amongst all the data bearing nodes, one and
only one member is deemed as the primary node, while the
other nodes are deemed secondary nodes.
Empirically The primary node receives all write
operations[11]. The secondaries replicate the primary’s
oplog and apply the operations to their data sets. If the
primary is unavailable, an eligible secondary will hold an
election to elect itself the new primary.A rollback reverts
write operations on a former primary when the member
rejoins its replica set after a failover, if the primary had
accepted write operations that the secondaries had not
successfully replicated before the primary stepped down.
Fig. 3 MongoDB Replication
C. Kafka Replication
Every topic partition in Kafka is replicated n
(configurable replication factor) times. This allows Kafka to
automatically failover to these replicas when a server in the
cluster fails so that messages remain available in the presence
of failures. Replication in Kafka happens at the partition
granularity where the partition’s write-ahead log is replicated
in order to n servers. Out of the n replicas, one replica is
designated as the leader while others are followers. Leader
takes the writes from the producer and the followers merely
copies the leader’s log in order.
The leader for every partition tracks this in-sync replica
(aka ISR) list by computing the lag of every replica from
itself. When a producer sends a message to the broker, it is
written by the leader and replicated to all the partition’s
replicas. A message is committed only after it has been
successfully copied to all the in-sync replicas.

Fig. 4 Kafka replication [7]
D. HDFS Replication
HDFS replication enables replication of HDFS data from
one HDFS service to another, synchronizing the data set on
the destination service with the data set on the source
service. While performing a replication we need to ensure
that the source directory is not modified. A file added during
replication does not get replicated. If we delete a file during
replication, the replication fails.
HDFS is not optimized for incremental write /
append . It’s suitable for Write Once and read many times
model. It stores each file as a sequence of blocks . Blocks
are placed on different Data Nodes[12]. NameNode keeps
track of the Blocks in the data Node.Default Replication
factor is 3.
Fig. 5 Block Replication[12]
Each HDFS block is constructed through a write pipeline.
Bytes are pushed to the pipeline packet by packet. There are
effectively 3 stages in a HDFS replication -
Stage 1 - In this stage a pipeline is been set up. A Write Block
request is sent by a client downstream along the pipeline.
After the last Data Node receives the request, an ack is sent by
the Data Node upstream along the pipeline back to
the client.
Stage 2 - In this stage user data first buffer at the client side.
After a packet is filled up, the data then get pushed to the
pipeline. We can call this Data streaming stage.
Stage 3 - In this stage the client sends a close request only
after all packets have been acknowledged at the client side.
When a block replication is finalized it shuts down block the
pipeline
Fig. 6 Block Construction Pipeline
E. Amazon’s Dynamo
Amazon’s Dynamo is a highly available key value
storage system. It support primary key access to the data,
which can be useful for services like session management.
Dynamo’s use cases for these types of services will provide a
highly available system that always accept write queries.
These requirements force the complexity of conflict
resolution to data readers. Writes are never rejected.
Dynamo combines many core distributed system techniques
to solve problem at Amazon scale. And focuses on solving
the problem in data versioning, partitioning and replication.
Data Versioning
When writes are initiate to replicas asynchronously,
Dynamo shows eventual consistency. Non-updated nodes
returns any object which is not updated with latest version at
the time when we fire get operation. The result at each
modification are treated as new and immutable version of
data which is assigned by a vector clock which increments at
every version. When we want to update the object, it
specifies the version of object which is updating and when
any client reads the object. Client is responsible for the merge
these divergent versions according to our need.It also
provides the background process which automatically merges
versions of data without any conflicts.
Partitioning
Dynamo allows the system to scale incrementally by
adding more number of nodes.Which requires system to

dynamically partition data over the set of nodes. For
achieving this, Dynamo introduce consistent hashing for
assigning each data item to a node. These nodes are arranged
in a ring where the largest hash value wraps around to the
smallest hash value.
The arrival and departure of nodes from the cluster only
affects that nodes immediate neighbours, we can observe this
when we arrange nodes in a ring
Fig. 7 Dynamo Ring Arrangement[3]
Replication
Data is replicated on many hosts for providing the
durability and high data availability in Dynamo. Every data is
replicated at n hosts. Every data key is assigned a coordinator
node which is in charge with replicating data at n-1 neighbor
hosts in the ring.
F. Data replication in Cassandra
Cassandra design is influenced by Amazon’s Dynamo
paper published in 2007. It divides a hash ring into a several
chunks, and keeps N replicas of each chunk on different
nodes.Developers can tune quorums, and active anti-entropy
to keep replicas up to date.[4]
Cassandra uses replication to achieve high availability
and durability. Each data item is replicated at N
(configurable replication factor) hosts . Each key, k, is
assigned to a coordinator node. The coordinator takes care
of the replication.Coordinator stores the data locally and
also replicates the data at the N-1 nodes. Cassandra provides
configurable replication policies like “Rack Unaware”,
“Rack Aware” (within a datacenter) and “Datacenter
Aware”.Cassandra system elects a leader amongst its nodes
using a system called Zookeeper. A participating node in the
cluster contacts the leader, who intern tells them for what
data ranges they are replicas for. The metadata about the
ranges a node is responsible is stored locally at each node
and inside Zookeeper. So when a node crashes and comes
back up knows what ranges it was responsible for. All nodes
aware of every other node in the system and hence the range
they are responsible for.
Fig. 8 Cassandra Data Flow
There are three types of read requests that a coordinator
sends to replicas.
a) Direct request
b) Digest request
c) Read repair request
● The coordinator sends direct request to one of the
replicas.
● After that, the coordinator sends the digest request
to the number of replicas specified by the
consistency level and checks whether the returned
data is an updated data.
● After that, the coordinator sends digest request to
all the remaining replicas. If any node gives out of
date value, a background read repair request will
update that data. This process is called read repair
mechanism.
V. OUR PROPOSED SYSTEM DESIGN
Our implementation, inspired by Dynamo system[3],
proposes a slightly different approach. We have
disintegrated the data nodes from the event bus and
coordinator system.
We researched several Scalable System Design Patterns[2]
and found out " event-based architecture" model is most
suitable to address this problem .Event-based architecture
supports several communication styles:
● Publish-subscribe
● Broadcast
● Point-to-point
Publish-subscribe communication style decouples
sender & receiver and facilitates asynchronous
communication. Event-driven architecture (EDA) promotes
the production, detection, consumption of, and reaction to
events[10].
The main advantage of this architecture is that they are
loosely coupled.

Fig. 9 Event Driven Architectural Model
A. Client sends the write query to a write coordinator system
. Co- ordinator system pushes data into a queue against a
particular topic.
@RequestMapping(method = RequestMethod.POST,
consumes = {"application/x-www-form-urlencoded"},
value = "/api/savedata")
public String saveData(@RequestParam Map<String,
String> savequery){
String key = savequery.get("key");
String value = savequery.get("value");
kafkaTemplate.send("savedata",
dataFormattor(key,value));
return "Data successfully Saved";
}
B. All the participating systems subscribes to that topic and
listens to the changes. Each subscriber belongs to a different
group , and they processes write queries parallely and
independent of each other. Saves the changes to local file
system. At this point we can send write acknowledgement
back to client if data is written to at least one node or wait
for ack from all the nodes , thus enforcing strong
consistency.
@KafkaListener(topics = "savedata", groupId =
"${diskpath.property}")
public void saveDataToDisk(String message) throws
IOException{
String data[] = dataDeserializer(message);
File file = new File(diskpath+"/"+data[0]);
file.getParentFile().mkdirs();
FileWriter writer = new FileWriter(file);
try {
writer.write(data[1]);
} catch(Exception e){
logger.info(e.getMessage());
}finally{
writer.close();
}
}
C. When a read query comes to a participating node ,
instead of directly replying to the query , it constitutes a
quorum , patiently waits for all the participating node to
catch up and then returns the result back to client if all the
Node agrees to same data . Thus enforcing strong
consistency . Here we can send back a reply even if majority
of the nodes agrees to the result , resulting in successful
elimination of byzantine faults.
@KafkaListener(topics = "retrievedata", groupId =
"coordinator")
public void retrieveValue(String message) throws
FileNotFoundException, IOException{
String data[] = dataDeserializer(message);
//Read data from all the participating nodes
BufferedInputStream reader1 = new
BufferedInputStream(new
FileInputStream(diskpath+"/"+data[1]) );
FileInputStream(machine1+"/"+data[1]) );
FileInputStream(machine2+"/"+data[1]) );
boolean running = true;
while( running ) {
// Wait for all the nodes to catchup .
if( reader1.available() > 0 && reader2.available() > 0 &&
reader3.available() > 0) {
String val1 = IOUtils.toString(reader1, "UTF-8");
//Constitute a quorum . All 3 nodes should match.
Enforce Strong consistency
if(val1.equals(val2) && val2.equals(val3)){
webSocket.convertAndSend("/topic/backToClient/"+data[0],
val1);
running = false;
} else {
webSocket.convertAndSend("/topic/backToClient/"+data[0],
"Nodes give different Data");
running = false;
}
}else {
try {
Thread.sleep(150);
}catch( InterruptedException ex ) {
running = false;
}
}
}
}

In essence this proposed architecture looks as follows.
Fig. 10 Talon Store Architecture.
VI. RESULTS AND DISCUSSION
We tested our system with multiple parallel client
requests and a dual partition queue topic with 2 brokers .
And received following metrics in a macOs Mojave
(MacBook Pro 2017 , 2.3 GHz Intel Core i5 , 8 GB 2133
MHz LPDDR3 )
Our results shows this architecture performs marginally
faster compared to cassandra under similar load , but much
of this data can be influenced by the fact that we tested on a
single standalone system instead of a actual distributed
network of the system.
However proposed design is much more decoupled
where we can tweak and configure each blocks of the
system separately.
Write query performance
Read query performance
start.time, end.time indicates experiment start and end time.
fetch.size - Amount of data to fetch in a single request.
data.consumed.in.MB - Size of all messages consumed.
***MB.sec* - Data transferred in MB per sec(Throughput
on size).
data.consumed.in.nMsg - Count of the total message which
was consumed during this test.
nMsg.sec - How many messages consumed in a
sec(Throughput on the count of messages).
VII. FUTURE Work
In this literature we only discussed about how we
can detect byzantine faults and didn’t actually correct them.
How ever this can be easily addressed the way Cassandra
solves this. Once the read coordinator detects the faulty
node, it can send a data repair request to faulty node and that
node in turn fix the corrupt data by using GOSSIP protocol
with other participating node.
Also for this research we implemented web based client
and web based client can not have a socket connection with
any server running on different domain due to CORS issues,
so we had to serve read queries from from the Rest
Controller which receives the query. We can easily bypass
this limitation in a non web based client .
VIII. APPENDIX
Project Final Demo Link
https://www.youtube.com/watch?v=0jBl7rOrQiU
Source code - https://github.com/sap9433/TalonSystems
IX. REFERENCES
[1] Shvachko, K., Kuang, H., Radia, S. and Chansler, R.
(2018). The Hadoop Distributed File System. [online]
Storageconference.us. Available at: http://storageconference
.us/2010/Papers/MSST/Shvachko.pdf [Accessed 28 Nov.
2018].
[2] Kreps, J., Narkhede, N. and Rao, J. (2018). Kafka: a
Distributed Messaging System for Log Processing. [online]
Notes.stephenholiday.com. Available at: http://notes.stephen
holiday.com/Kafka.pdf [Accessed 28 Nov. 2018].
[3] DeCandia, G., Sivasubramanian, S., Lakshman, A. and
Hastorun, D. (2018). Dynamo: Amazon’s Highly Available
Key-value Store. [online] Courses.cse.tamu.edu. Available
at: http://courses.cse.tamu.edu/caverlee/csce438/readings/dy
namo-paper.pdf [Accessed 28 Nov. 2018].
[4] Lakshman, A. and Malik, P. (2018). Cassandra - A
Decentralized Structured Storage System. cs.cornell.edu.
[online] Available at: https://www.cs.cornell.edu/projects/
ladis2009/papers/lakshman-ladis2009.pdf [Accessed 28
Nov. 2018].
[5] "Multi-Master Replication." MySQL at Twitter: No
More Forkin' - Migrating to MySQL Community Version |

Percona Live - Open Source Database Conference 2018.
Accessed November 26, 2018. https://www.percona.com
/doc/percona-xtradb-cluster/LATEST/features/multimaster-r
eplication.html.
[6] "Database Advanced Replication." Master Replication
Concepts and Architecture. August 01, 2008. Accessed
November 26, 2018. https://docs.oracle.com/cd/B28359
_01/server.111/b28326/repmaster.htm#sthref144.
[7] Narkhede, N. (2018). Hands-free Kafka Replication: A
lesson in operational simplicity. confluent. [online]
Available at: https://www.confluent.io/blog/hands-free-
kafka-replication-a-lesson-in-operational-simplicity/[Access
ed 28 Nov. 2018].
[8] Fabio Erculiani. "Google/mysql-tools." GitHub.
Accessed December 06, 2018. https://github.com/
google/mysql-tools/wiki/Semi-Sync-Replication-Design.
[9]"Database Advanced Replication." Master Replication
Concepts and Architecture. August 01, 2008. Accessed
November 26, 2018. https://docs.oracle.com/
cd/B28359_01/server.111/b28326/repmaster.htm#sthref144
[10] Dr. Tong Lai Yu. "Distributed Systems Architecture."
Distributed Systems Architecture. Accessed December 06,
2018.
http://cse.csusb.edu/tongyu/courses/cs660/notes/distarch.php
[11] "Replication." In-Memory Storage Engine - MongoDB
Manual. Accessed December 06, 2018. https://docs.mongo
db.com/manual/replication/.
[12]Bakshi, Ashish. "Hadoop Distributed File System |
Apache Hadoop HDFS Architecture | Edureka." Edureka
Blog. December 05, 2018. Accessed December 06, 2018.
https://www.edureka.co/blog/apache-hadoop-hdfs-architectu
re/.

Talon systems - Distributed multi master replication strategy

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Talon systems - Distributed multi master replication strategy

Similaire à Talon systems - Distributed multi master replication strategy (20)

Plus de Saptarshi Chatterjee

Plus de Saptarshi Chatterjee (18)

Dernier

Dernier (20)

Talon systems - Distributed multi master replication strategy