"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
Infinispan, transactional key value data grid and nosql database
1.
2. • Sr. Consultant at Inmeta Consulting
• Current project: Skattetaten Grid POC
• Previous projects involving grid technologies:
• Mattilsynet food authority system.
• FrameSolution BPM framework used in Lovisa National Court
Authority(Norway), Mattilsynet Food Authority
• Other noteworthy projects
• Coca Cola Basis ERP system – Coca Cola Bottler factories
• mPower Mobilitec 300 million subscribers worldwide, and delivers
over 500,000 pieces of content every day.
3. • Big data, Databases are slow. Memory is FAST!
• Provides huge computing power.
• Tax calculation
• Financial organizations
• Government organizations use it for communication and
data sharing between the different departments.
• Scientific computations
• MMORPG games
4. • General terminology relevant to Distributed Caching
• Challenges related to introducing distributed caching to
existing system
• Metrics and tuning
5. • Cache JSR – 107
• Java Data Grid JSR - 347
• In memory Data Grid
• Cluster
• Distribution
• Node – a member of a cluster
• Transaction awareness
• Colocation
• Map / Reduce
• Consistency
11. • Our Custom cache is super fast, but its cache hit ratio is
rather low.
• Our custom cache has a tendency of getting dirty as the
updates to the shared data can not be propagated. At the
same time the separation of the data regions is not full.
• Marshaling is a rather slow and heavy process.
• We are facing a technological cocktail and we need to keep
integrity.
15. • Eviction
• Least Recently Used
• First In First Out
• LIRS
• Custom
• Expiration
• Invalidation
16. • Ref. Data vs Transactional
• Reference data: Good.
Max 30000 reads/sec 1k size
• Transactional data: Good.
Max 25000 writes/sec 1k size
.
17. • Reference data: Good.
30000 reads/sec per server.
Grow linearly by adding servers.
• Transactional data: Not so
good. Max 20000writes/second.
Drops if you add 3rd server to
2500.
18. • Ref. Data vs Transactional
• Reference data: Good.
Max 30000 reads/sec 1k size
• Transactional data: Good.
Max 25000 writes/sec 1k size
19. • Reference data(1kb):Good.
30000 reads/sec per server.
Grow linearly by adding servers.
• Transactional data(1kb):Good.
20000 writes/sec per server.
Grow linearly by adding servers.
20. • What is the size of our cluster? Reads vs. Writes
• Communication inside our grid
• UDP,TCP
• Synchronous vs. Asynchronous.
• What about the transaction isolation?
• Repeatable Reads vs. Read Committed
• What is the nature of our application?
• Read intensive data
• CMS systems
• Write Intensive Data
• Document Management System
21. • Level1 cache is
Supported only for
Distribution mode
• Level 1 cache might
have a performance
Impact in certain
systems
23. • Long running transactions need to be avoided.
• What is a long running transaction? How long is actually long.
• Read Committed vs Repeatable Reads
24. begin Update(A) Update(B) Update(C) Update(B)
Begin Update(C) Update(B) Release(A) Lock(A)
TX1 (Wants update A,B,C)
TX2 (Wants to update C,B,A)
C is locked
by
TX2
A is locked
by
TX1
25. begin get(k) - - Get(k)
Begin Get(k) put(k, v2) commit
What is returned??
TX1
TX2
10 years of experience as architect or developer. Participation in various large scale enterprise products for world leading companies like Coca Cola Computer Service, Mobilitec.Last 5 years focus on distributed system in the context of Business Process Management.
A data grid is an architecture or set of services that enable individuals or groups of users the ability to access, modify and transfer extremely large amounts of geographically distributed data
Data grids, or IMDGs (In-memory data grids) are, according to Gartner, defined as:IMDGs implement the notion of a "distributed, in-memory virtual data store" (typically called the data grid, but at times called the "cache" or "space" for historical reasons) by clustering the central memory (RAM) of multiple computers over a network. This allows applications to deal with very large (up to multiple petabytes in size, in some user experiences) in-memory data stores, and leverage fast and scalable access to data. IMDGs provide the mechanisms and APIs that presents to applications the memory of the clustered computers as a uniform, integrated data store. Applications don't need to know in which computer's RAM a given data object is stored to retrieve it. The IMDG runtime retrieves the required object across the data grid in a location-transparent way, while managing such issues as security, data integrity, availability and recovery, in case of system crashes.In this context there are many definitions of what a grid and a cache is, some of them more business oriented some more technical. In an attempt to answer the question“What a grid is?” I opened the unaproved yet JSR 347 specification and its glossary in order to define a fundament minimal framework of qualities that a grid corresponds to.Here is the full list.CacheA temporary in-memory store of data exhibiting high performance, threadsafe access. JSR 107 (Temporary caching for the Java Platform) covers this concept in more detail.Distributed CacheOften, data grids are used as distributed, cluster-aware in-memory caches, usually placed in front of a more expensive, slower data store such as a relational database. Standalone caches don't work in this regard, if the application tier is clustered, as caches could serve stale data. JSR 107 covers distributed caches as well to some degree.ClusterA set of servers connected via a network, usually a LAN.DistributionThe concept of a cluster spreading data across its various constituent nodes in a manner transparent to any client attempting to locate or use such data.NodeA member in a cluster. A node may be a separate physical machine, a separate virtual machine on the same physical host, or a separate JVM on the same physical or virtual machine. Typically, each node would have its own network address, such as an IP address and port, on which other nodes could connect to it.TransactionAn atomic unit of work. Transactions may be JTA and XA compliant.ColocationThe concept of ensuring data entries that are used together in the same transaction are stored on the same nodes in a cluster.Map/ReduceBased on Google's seminal paper from 2004, Map/Reduce allows computations on the entire data set to be broken down into tasks that run on each node and then aggregate results. It is a divide and conquer technique for dealing with large data sets.Eventual ConsistencyBased on Eric Brewer)'s CAP theorem which outlines desirable characteristics of distributed systems, Eventual Consistency is the result of attempting to provide high availability even during network partitions. See Coda Hale's excellent blog on the subject.
We define this model of a real world application not as a blueprint. Actualy it is an application with many good sides and many flaws. Just like any average application.We have a group of several application servers forming a cluster in the Internal Network. Another set of tomcat or web servers in the demilitarized zone. A document management system and an ESB server.What we can say about this system is that it represents a classical medium to large scale application(more on the medium side). We have marshaling of the data once at the first firewall and second time at the second firewall. Although the app servers form a cluster there is no distributed caching within this system.
Our backend demonstrates a very simple approach for preserving consistency of the data within the cluster. In order to avoid collisions on the heavily updated data.The architects of the system have decided to separate the data in regions and each server to point to its own region. Probably such approach might fix some of the issues in short term but it does not propose a long term solution. Heavily updated data will produce more and more Optimistic locking exceptions (If there is optimistic locking at all). Clear separation between the data is mission imposible there is always that small amount of shared data which will cause troubles.So our system expands with time and those flaws become more and more visible. At one point a distributed caching solution is offered to the client in an attempt to fix these data integrity issues.The Integration framework is a sort of anticoruption layer that keeps integrity and unifies the approach to entities with different origine. For example some Pojoes are just cached from Ephorte Document management system, some other entities in the cache originate from the DB layer. We have a system with many different sources of data that Is kept in the cache.
On this slide we can observe some of the technological challenges that the system presents. We have a technological cocktail presented to us because of the 10 years history of development of this application. Many teams have worked upon this system during these 10 years. The agile approach which is more feature oriented clearly puts its tole upon this application as we have different modules written in different time period using different technologies. Here the anticorruption layer is making everything work transparently together.We can identify several problematic areas in terms of our future migration to distributed caching:Transaction scope (When we should use transaction, when we should split them in several, when we should not use transactions at all)Locking and potential deadlocking situations.Once we remove the Legacy Cache our Anticorruption layer will stop function, we will be exposed to the underlying technology.Performance, we should be ready that in the distributed cache might apear slower than our legacy cache because the old cache holds the hydrated value of the object.The access to this value is instant. At the same time the distributed cache holds its binary form which needs to be marshaled first. Our old cache at the same time is more prone to overflowing and cache misses so it is quite easy with a good test to demonstrate that actually the old cache is performing poorly under certain conditions.The mixture of the technology stack might present additional challenges. Mixing JPA code and JDBC for example should present a challenge in terms of flushing policy. In order to coexist the Entity Manager needs to be flushed every single time before and SQL query is executed. If you open the JPA specification and and the cache flush mode you may observe that this kind of behavior is default when JPA query is executed so this flushing will not present any performance issues as it is within the framework of the regular behavior. This is just a single example of low level challenges that needs to be solved. There are many others.
Our Legacy cache might have many different interested parties. DTOs coming from EJB 2.1 entity beans(EJB 2.1 are not serialize able). JPA entities custom collections,Non persistent pojoes from different web services and so on….
Our end goal is to remove the old legacy cache. At the same time we want the rest of our application to behave in exactly the same way as it was behaving before. Unfortunatly when we have removed the cache we have exposed the integration framework to the underlying technology so the majority of our code dealing with co-existence of the different technologies will reside exactly there. Our goal is to mimic the behavour of the old system and when we are not able to do that to minimize as much as possible the required changes.
Replication can be synchronous or asynchronous(Wright through or write behind). Synchronous replication blocks the caller (e.g. on a put() ) until the modifications have been replicated successfully to all nodes in a cluster. Asynchronous replication performs replication in the background (the put() returns immediately).Infinispan offers a replication queue, where modifications are replicated periodically (i.e. interval-based), or when the queue size exceeds a number of elements, or a combination thereof. A replication queue can therefore offer much higher performance as the actual replication is performed by a background thread.Asynchronous replication is faster (no caller blocking), because synchronous replication requires acknowledgments from all nodes in a cluster that they received and applied the modification successfully (round-trip time). However, when a synchronous replication returns successfully, the caller knows for sure that all modifications have been applied to all cache instances, whereas this is not be the case with asynchronous replication. With asynchronous replication, errors are simply written to a log. Even when using transactions, a transaction may succeed but replication may not succeed on all cache instances.
Invalidation is a clustered mode that does not actually share any data at all, but simply aims to remove data that may be stale from remote caches. This cache mode only makes sense if you have another, permanent store for your data such as a database and are only using Infinispan as an optimization in a read-heavy system, to prevent hitting the database every time you need some state. If a cache is configured for invalidation rather than replication, every time data is changed in a cache other caches in the cluster receive a message informing them that their data is now stale and should be evicted from memory.Invalidation too can be synchronous or asynchronous, and just as in the case of replication, synchronous invalidation blocks until all caches in the cluster receive invalidation messages and have evicted stale data while asynchronous invalidation works in a 'fire-and-forget' mode, where invalidation messages are broadcast but doesn't block and wait for responses.
Distribution is a powerful clustering mode which allows Infinispan to scale linearly as more servers are added to the cluster. Hashing algorithm is configured with the number of copies each cache entry should be maintained cluster-wide. More copies, lower performance. Regardless of how many copies are maintained, distribution still scales linearly.
Eviction refers to the process by which old, relatively unused, or excessively big data can be dropped from the cache, allowing the cache to remain within a memory budget.When a segment get full the eviction thread will be able to dispose its content. This is why usually eviction happens before the maximum of entries specified on a cache region is reached.We can define the lifespan of an entity. Once this lifespan is achieved the entity expires.Invalidation occurs when an entity is deleted from a cache region. This might occur for example if and entity is updated and its consistency needs to be preserved accros the different nodes.
Reference data use means you cache something once and read it over and over again. So, there are a lot more reads than writes. On the other hand, transactional data use means that you're updating the data as frequently as you're reading it (or fairly close to it). A Mirrored Cache is a 2-server active/passive cache cluster. All the clients only connect to the active cache server and do their read and write operations against it. For all updates done to the cache (add, insert, and remove) the same updates are also made to the passive server but in the background and as bulk operations. This means that the clients don't have to wait for the updates to be done to the passive server. As soon as the active server is updated, the control returns to the client and then the passive server is updated by a background thread.This gives Mirrored Cache a significant performance boost over a Replicated Cache of the same size cluster. A Mirrored Cache is almost as fast as a stand-alone Local Cache which has no clustering cost. But, at the same time, a Mirrored Cache provides reliability through replication in case the active cache server goes down.If the active server ever goes down, the passive server automatically becomes active and all clients automatically connect to this new active server. All of this happens without any interruptions to your application. When we bring the previously active server back up, it joins the cluster and becomes passive since there is now another server that is already active.But, Mirrored Cache accommodates situations where you only have one dedicated cache server and the mirror server is being shared with other apps. But, if you have a need for 3 or more cache servers, then Partition-Replica Cache is the best choice for transactional use.
A Replicated Cache consists of two or more cache servers in a cluster. Each cache server contains the entire cache and any updates to the cache on any server are applied synchronously to all the other servers in the cluster. Replicated Cache ensures that all updates to the cache are made as atomic operations, meaning either all cache servers are updated or none are updated. The benefit of Replicated Cache is the extremely fast GET performance. Whichever server a client is connected to always has the entire cache. As a result, all GET operations find the data locally on that cache server and this boost the GET speed. However, the cost of an update operation is not very scalable if you want to add servers to a Replicated Cache.
A Partitioned Cache is intended for larger cache clusters as it is a very scalable caching topology. The cost of a GET or UPDATE operation remains fairly constant regardless of how big your cache cluster is. There are two reasons for it. First of all, the cache partitioning is based on a Hash Map algorithm (similar to a Hashtable). And, a distribution map is created and sent to all the clients that tells the clients which partition has the data or should have the data. This allows the clients to directly go to the cache server that has the data it is looking for.Secondly, all updates are made to only one server and therefore no sequencing logic is required. Obtaining a sequence adds on extra network round-trip in most cases.So, not only GET operations are as fast as Replicated Cache, the UPDATE operations are much faster and remain fast regardless of how large the cache cluster gets. This constant cost makes Partitioned Cache a highly scalable topology.However, please note that there is no replication in Partitioned Cache. So, if any cache server goes down, you lose that much cache. This may be okay in many object caching situations but is not okay when you're using the cache as your main data repository without the data existing in any master data source. A good example of this is ASP.NET Session State storage or JSP Servlet session storage in the cache.
Partitioned-Replica Cache is a combination of Partitioned Cache and Replicated Cache. It gives you best of both worlds. You get reliability through replication and scalability through partitioning. Instead of replicating the cache over and over again if you have more than 2 servers in the cluster, you only replicate the cache once (meaning only two copies of the cache exist) regardless of how big the cache cluster is. This allows you to scale out through partitioning.
Now this is the moment when we should take a calculator and start taking the metrics of our system such metrics are:Size of the clusterAverage size of the marshaled data.Size of the replication queue (if used)Do we have different locations.Average lock durationDo we use a persistent store (Hibernate)and more…7.What is our system read intensive, write intensive. Probably if it is write intensive we should think of Partitioned strategy.We can mix more than one topology within our system. User session data and reference data can use Replication at the same time update heavy data will use partitioning.For small clusters we will use TCP for large UDP . Why ? UDP creates smaller amount of network traffic. Probably one of the most important question is How many servers are we going to use in the grid.
Level one cache is a region that may reside in every node. When asked the node for a value, if it does not exist the call will be propagated to another node. When the result is returned if L1 cache is enabled the result will be placed within this region for user defined time so that repetative calls will hit the L1 cache instead of doing remote calls everytime. Whenan entry in the cache is updated it needs to be invalidated across the whole cluster and in all L1 caches so if no repetitive calls are occurring within the system we might have a performance penalty for enabling this cashe.
Invalidation, when used with a shared cache loader would cause remote caches to refer to the shared cache loader to retrieve modified data. The benefit of this is twofold: network traffic is minimized as invalidation messages are very small compared to replicating updated data, and also that other caches in the cluster look up modified data in a lazy manner, only when needed.Within Infinispan we have different CacheLaoders and CacheStore, every CacheLoader when persistent can be also called CacheStore. Through CacheLoader the laodingt process of particular value when it does not exist in memory can be detegated to a third party. The third party may be a persistent store like RDBMS or NoSQL database, it may be another cluster or something completely different. If passivation is enabled for a cachestore this means that an entity can exist either within the store the loader is pointing to or in the memory but not both. So we have an XOR condition between them.
At the moment (Infinispan 5.0) two locking schemes are supported:pessimistic, in which locks are being acquired remotely on each transactional write. This is named eager locking and is detailed here.a hybrid optimistic-pessimistic locking approach, in which local locks are being acquired as transaction progresses and remote locks are being acquired at prepare time. This is the default locking scheme.This document describes a replacement for the hybrid locking scheme with an optimistic scheme. The rule of the thumb is that all READ operations are Lock free and all Write operations aquire a lock. This lock can be Remote(Cluster Wide) lock or local lock for the durationOf the transaction.Repeatable read is a higher isolation level, that in addition to the guarantees of the read committed level, it also guarantees that any data read cannot change, if the transaction reads the same data again, it will find the previously read data in place, unchanged, and available to read.
A classicaldeadlockexample inInfinispan prior to version 5.1 whenoptimisticlockingwasimplemented is theexample from the slide.We have 2 transactions and READ_COMMITED transactionisolation. One setof 3 values and twotransactionupdatingthevalues in different order1tx. ABC, 2tx.CBA withinthetransactionlocksareobtained from A and C and theneachthread is holding a lock to a valuetheotherthreadswants to use. No thread is able to advance and so a deadlockoccures.Again in Infinispan 5.1 there is optimisticlocking. Butbecausewearetalkingabout a legacy system most probablywewill not be able to usethe latest version.
One way to escape from the deadlock condition from the previous slide would be to simply elevate the transaction isolation level from READ_COMMITED to REPEATABLE_READ.This elevation might cause a slight performance decrease , but in the general case it should be so small that for most system it will be negligible. When we elevate the isolation level each thread will have its own view of the modified values which will be fixed for the given transaction span. So no matter if another transaction updates that value the original transaction view of the value will always be the same.On this slide the last get(K) method on the first transaction will return the same as the first get(K) although another transaction has updated the value in between first and the last call. If the isolation level was READ_COMMITED then the last get(K) would have returned the updated value.When we are talking of legacy applications with RDBMS store that most probably use some kind of entity framework Hibernate for example. They already behave in similar to REPEATABLE_READ isolation because of their Level 1 cache. The hibernate Level 1 cache acts per transaction, so one an entry is inserted there every repetable call will hit that value, no matter if another thread has already updated the value. So REPEATABLE_READ is 100% compatible with Hibernate application I would say that even it is recommended.
Here we have jconsole coming with JDK 5,6,7 very useful tool for monitoring the registered JMX beans also the threads and the garbage collector. We can use it to monitor the \Infinispan Statistics.Important metricsAverage Read timeNumber of evictionsCache HitsCache MissesRead to Write ration – based on this we can define if a particular region needs replication strategy or partition
Average Replication time. There is a timer set in the configuration if the timer is exceeded the transaction will fail. If we have such case we can either increase the timerOr we should think of custom serialization in attempt to make the entity more light wait. Or just think about how to minimize the amount of data send over the network.
The number of the concurrent updates that might happen. The concurenthashmap is devided in segments based on the concurrency level. The best thing to do here Is just to read the javadoc of ConcurentHashMap.
Standart Java serialization performance is very low. On the chart we can observe different serialziation frameworks and their performance. It depends on the test scenario butThe size of the marshaled data can be reduce 4 times if Externalization is used instead of Standart Java Serialization.
Keep in mind that our integration framework still exists and the majority of our code dealing with co-existence of the different technologies will reside.Exactly there.