SlideShare une entreprise Scribd logo
1  sur  74
Télécharger pour lire hors ligne
Oak, the Architecture of the new 
Repository 
Michael Dürig, Adobe Research Switzerland
Design goals 
• Scalable 
• Big repositories 
• Clustering 
• Customisable, flexible 
• OSGi friendly 
11/17/2014 2
Outline 
• CRUD 
• Changes 
• Search 
• Big Picture 
11/17/2014 3
Tree model 
a d 
b c 
11/17/2014 4
Updating 
? 
x 
a d 
b c 
11/17/2014 5
MVCC 
HEAD 
r1: / r2: / 
a d 
b c 
r1: /d r2: /d 
r1: /a/b 
r2: /a/b 
r2: /d/x 
11/17/2014 6
Refresh and Garbage Collection
Refresh 
garbage 
11/17/2014 8
Garbage collection 
garbage 
11/17/2014 9
Concurrency and Conflicts
Concurrent updates 
r2a r1 r2b 
11/17/2014 11
Merging 
merge 
r2a 
updates 
r1 r3 
r2b 
11/17/2014 12
Conflict handing: serialisation 
• Fully serialised 
– Fail, no concurrent update 
• Partially serialised 
– Concurrent conflict free updates 
11/17/2014 13
Conflict handling strategies: merging 
• Partial merge 
– Conflict markers, deferred resolution 
• Full merge 
– Need to choose victim 
11/17/2014 14
Replicas and Shards
Replica and caches 
master copy full replica cache 
11/17/2014 16
Sharding strategies 
by path by level by hash with caching 
11/17/2014 17
Implementations
MicroKernel / NodeStore 
• Tree / Revision model implementation 
Responsible for 
Clustering 
Sharding 
Caching 
Conflict handling 
Not responsible for 
Validation 
Access control 
Search 
Versioning 
11/17/2014 19
Current implementations 
DocumentMK TarMK (SegmentMK) 
Persistence MongoDB, JDBC Local FS 
Conflict handling Partial serialisation Full serialisation 
Clustering MongoDB clustering Simple failover 
Sharding MongoDB sharding N/A 
Node Performance Moderate High 
Key use cases Large deployments 
(>1TB), concurrent writes 
Small/medium 
deployments, mostly read 
11/17/2014 20
Access Control
Accessible paths 
a d 
b c 
11/17/2014 22
xistentialism 
• All paths traversable 
– Node may not exist 
– Decorator on NodeStore 
root.getChildNode("a").exists(); 
root.getChildNode("a") 
.getChildNode("b").exists(); 
⟹ false 
⟹ true 
11/17/2014 23
Comparing Revisions
Content diff 
• What changed between trees 
• Cornerstone for 
– Validation 
– Indexing 
– Observation 
– … 
11/17/2014 25
What changed? 
Δ 
11/17/2014 26
Example: merging 
r1 
r2a 
r2b 
r3 
Δ 
r1 ➞ r2a 
“a” modified 
“b” removed 
Δ 
r1 ➞ r2b 
“d” modified 
“x” added 
11/17/2014 27
Commit Hooks
Commit hooks 
• Key plugin mechanism 
– Higher level functionality 
• Validation (node type, access control, …) 
• Trigger (auto create, defaults, …) 
• Updates (index, …) 
11/17/2014 29
Editing a commit 
Δ Δ + x 
11/17/2014 30
Commit hooks 
• Based on content diff 
– pass a commit 
– fail a commit 
– edit a commit 
• Applied in sequence 
11/17/2014 31
Type of hooks 
CommitHook Editor Validator 
Content diff Optional Always Always 
Can modify Yes Yes No 
Programming 
model 
Simple Callbacks Callbacks 
Performance 
impact 
High Medium Low 
11/17/2014 32
Observers
Observers 
• Observe changes 
– After commit 
– Often does a content diff 
– Asynchronous 
– Optionally synchronous 
• Local cluster node only 
11/17/2014 34
Examples 
• JCR observation 
• External index update 
• Cache invalidation 
• Logging 
11/17/2014 35
Search
Query Engine 
SELECT 
WHERE x=y 
/a//* 
Parse 
r 
parse execute post process 
Parse 
r 
Parser 
Parser 
Index 
Index 
Index 
Traverse 
11/17/2014 37
Index Implementations 
• Property (ordered) 
• Reference 
• Lucene 
– In-content or file system 
• Solr 
– Embedded or external 
11/17/2014 38
Big Picture
Big picture 
JCR API 
Oak JCR 
Oak API 
Oak Core 
NodeStore API 
MicroKernel 
Plugins 
11/17/2014 40
Resources 
http://jackrabbit.apache.org/oak/ 
11/17/2014 41
Appendix
Resources 
http://jackrabbit.apache.org/oak/ 
http://jackrabbit.apache.org/oak/docs/ 
https://svn.apache.org/repos/asf/jackrabbit/o 
ak/trunk/ 
11/17/2014 43
Session Notes
Slide 1 
This presentation is mainly about Oak’s architecture and design. Understanding these concepts gives 
crucial insight in how to make the most out of Oak and to why Oak might behave differently than 
Jackrabbit 2 in some cases. 
11/17/2014 45
Slide 2 
Jackrabbit Oak started early 2012 with some initial ideas dating back as far as 2008. It became 
necessary as many parts of Jackrabbit 2 outgrew their original design. Most of Jackrabbit 2’s features 
date back to the 90-ies and are not well suited for today's requirements. Oak was designed to 
overcome those challenges and to serve as the foundation of modern web content management 
systems. 
Key design goals: 
• scalable writes. The web is not read only any more. 
• large amounts of data. There is much more as a few web pages nowadays. 
• Built in clustering. Instead of built on top 
• Customisable 
• OSGi friendly 
Since Oak doesn't need to be the JCR reference implementation, we gained some additional design 
space by not having to implement all of the optional features (like e.g. same name siblings and support 
for multiple work spaces). 
11/17/2014 46
Slide 3 
• CRUD: this presentation first covers the underlying persistence model: the tree model and basic 
create, read, update and delete operations. 
• Changes: being able to track changes between different revisions of a tree turns out to be crucial 
for building higher level functionality. 
• Search: while nothing much changed on the outside, search is completely different in Oak wrt. 
Jackrabbit 2. 
11/17/2014 47
Slide 4 
Let’s consider a simple hierarchy of nodes. Each node (except the root) has a single parent and any 
number of child nodes. The parent-child relationships are named, i.e. each child has a unique name 
within its parent. This makes it possible to uniquely identify any node using its path: a user can access 
all content by path starting from the root node. 
This is a key different to Jackrabbit 2 where each node was assigned an unique id to look it up from 
the persistence store. In Oak nodes are always addressed its path from the root. In this sense Oak 
stores (sub) trees while Jackrabbit 2 stores key value pairs. In Oak one traverses down from the root 
following a path while in Jackrabbit 2 traversal was from a node to its parent up to the root. 
• Tree persistence vs. key/value persistence 
• Path vs. UID as primary identifier 
• Traversing down vs. traversing up 
11/17/2014 48
Slide 5 
Let’s consider what happens when another user updates parts of the tree. For example adds a new 
node at /d/x. Such in place changes might confuse other users whose tree suddenly change. 
This is how Jackrabbit 2 works, each update is immediately made visible to all users. Unfortunately, 
beyond the potential for confusion, this design turns out to be a major concurrency bottleneck, as the 
synchronisation overhead of keeping everyone aware of all changes as they happen becomes very 
high. The existing Jackrabbit architecture was heavily optimized for mostly-read use cases, with only 
occasional and rarely concurrent content updates. Unfortunately that optimisation no longer works too 
well with increasingly interactive web sites and other content applications where all users are potential 
content editors. 
More generally the way such state transitions are handled has a major impact on how efficiently a 
system can scale up to handle lots of concurrent updates. Many noSQL systems use the concept of 
eventual consistency which leaves the rate (and often order) at which new updates become visible to 
users undefined. This solves the concurrency issue, but can lead to even more confusion as it might 
not be possible to clearly define the exact state of the repository. 
The hierarchical structure of Oak allows us to solve both of these issues by borrowing an idea from 
version control systems like Git or Subversion. 
11/17/2014 49
Slide 6 
Instead of overwriting the existing content, a new revision of content is created, with new copies of the 
parent nodes all the way to the root if needed. This allows all users to keep accessing their revision of 
content regardless of what changes are being made elsewhere. All users are under the impression 
they operate on a private copy of the whole repository (MVCC) 
To make this work, each revision is assigned a unique revision identifier, and all paths and content 
accessed are evaluated in the context of a specific revision. For example the original version of the /d 
node would be found by following that path within revision r1, and the new version of the node by 
following the path in revision r2. The unchanged node /a/c would be reachable and identical through 
both revision r1 and r2. 
Additionally the repository keeps track of the HEAD revision that records what the latest state of the 
repository is. A new user that has not already accessed the repository would start with the HEAD 
revision. 
The ordered list of the current HEAD revision together with all former HEAD revisions form a journal, 
which represent a linear sequence of changes the repository went through until eventually reaching 
the current HEAD revision. 
11/17/2014 50
Slide 8 
Let’s consider what happens when a user on an old revision wants to go forward the HEAD revision. 
Unlike with classic Jackrabbit, where this would always happen automatically, in Oak this state 
transition is handled explicitly as a refresh operation. 
Refresh in Oak is explicit where it was implicit in Jackrabbit 2. For backward compatibility Oak 
provides some "auto refresh" logic. See the Oak documentation for further details. 
11/17/2014 51
Slide 9 
After the refresh, the older versions of specific nodes may no longer be accessible by any clients, and 
will thus become garbage, which the system will eventually collect. This is a key difference to a version 
control system, and allows Oak repositories to work even with workloads like thousands of updates 
per second without running out of space. 
11/17/2014 52
Slide 11 
What about concurrent updates? Here we have two competing revisions, r2a and r2b. Both modify the 
same base revision r1. The r2a revision removes the node at /a/b, and r2b is the revision we saw 
earlier, which add a new node at /d/x. 
11/17/2014 53
Slide 12 
We start with the base revision r1. Then the two new revisions are created concurrently. Finally, the 
system will automatically merge the results, which will create the new revision r3 containing the 
merged changes from r2a and r2b. The merge could happen in different cluster nodes, different data 
centres or even different local disconnected copies (e.g. like git clone). 
11/17/2014 54
Slide 13 
What happens when merging is not possible because two concurrent revisions conflict? There are 
several strategies for handling this case: 
• Full serialisation: don't allow concurrent updates and fail. This heavily impacts the write rate in the 
face of many concurrent writers. 
• Partial serialisation: instead of locking on the whole tree, lock on the root of the modified sub tree. 
Allows concurrent updates of changes to separate sub trees. 
Both strategies are currently implemented by Oak depending on the persistence back-end in use. 
11/17/2014 55
Slide 14 
Instead of pure serialisation a more sophisticated approach would be to semantically merge conflicting 
changes. Generally this needs domain knowledge and cannot be fully implemented in the persistence 
layer. 
• Partial merging: persist conflict markers along with the conflicting changes for deferred resolution 
by e.g. and administrator. 
• Full merging: this inevitable leads to data loss as one of the changes have to be preferred over the 
other. This is unless we can establish a total order over all revisions, which is usually not the case. 
Oak currently does not implement these strategies, though they could be plugged in. 
11/17/2014 56
Slide 16 
Replicas are straight forward: each replica only needs to follow the primary. Thanks to the immutable 
append only model of persistence, there is no need for invalidation. Replicas can do their individual 
garbage collections cycles. 
As a variant a replica can also serve as cache by only following frequently accessed parts of the tree 
and ignoring updates to other parts. 
11/17/2014 57
Slide 17 
Sharding by path is the most straight forward variant. However as sub tree sizes tend to be not evenly 
distributed shards end up to greatly vary in size. 
Sharding by level turn out to be even more problematic as there will be more content the deeper the 
tree while the root only has a single node. Although there are more revisions on the root (as every 
change creates a new root node), garbage collection will keep that number within reasonable limits. 
Sharding by content hash is what the DocumentMK does and which turned out to be most effective. A 
drawback of this approach is the loss of locality: the nodes of a path tend to be spread across various 
shards forcing navigational access to access multiple shards. An idea for addressing this is to allow 
each shard to cache a node's parent nodes. Since the lower levels contains the most content caching 
the parents is cheap while at the same time it restores locality. 
11/17/2014 58
Slide 19 
Unfortunately there is a bit of a naming confusion in Oak as the terms MicroKernel and NodeStore 
might both refer to APIs and to architectural components and are used somewhat interchangeably. 
One way to think of it is that the MicroKernel is an implementation of the tree model. While the 
NodeStore is a Java API for accessing the nodes of that tree model. 
A MicroKernel implementation is responsible for all the immutable update mechanics of the content 
tree along with conflict handling, garbage collection, clustering and sharding. It doesn't have any 
higher level functionality like validation, complex data types, access control, search or versioning. All 
the latter a implemented on to of the NodeStore API. 
11/17/2014 59
Slide 20 
Oak comes with basically two MicroKernel implementations: the DocumentMK (formerly MongoMK) 
and the TarMK (aka SegmentMK). The DocumentMK started out as being MongoDB backed but is 
now more flexible regarding the choice of the back-end. A JDBC back-end is currently being worked 
on. 
• The DocumentMK leverages the clustering and sharding capabilities of the underlying back-ends 
and implements partial serialisation. Due to the extra network layer involved, it has moderate 
single node performance, but scales with additional cluster nodes. 
• The TarMK uses local tar files for storage. It is not currently not cluster aware and only offers a 
simply fail over mechanism but offers maximal single node performance. 
The two MicroKernel implementations cover different uses cases: where DocumentMK is preliminary 
for large deployments involving concurrent writes, the TarMK is better suited to small to medium 
deployments, which are mostly read. We expect the gap between the two implementations to 
decrease in the future as there are ideas to make the TarMK more cluster ready while the 
DocumentMK still has room for improved performance. Finally other implementations (e.g. Hadoop 
based) are quite possible with the JDBC back-end for DocumentMK most likely being the first one. 
11/17/2014 60
Slide 22 
Access control is probably the most trick feature of JCR to implement. However it is also a very 
prominent feature and might as well be considered "the killer feature" for quite some content centric 
applications as implementing fine grained access control on top of applications is difficult and error 
prone. 
Access control in JCR allows a node to be accessible while it's parent isn't. This is contrary to the tree 
model where all access is by path traversing down from the root of the tree. We solved this in Oak by 
introducing the concept of "existence" for a node. Nodes that are not accessible are still traversable 
but do not exist. That is, the result of asking for a non accessible child node is indistinguishable from a 
"really" non existing child: both do not exist. With this all syntactically correct paths eventually resolve 
to a node, which however might not exist. 
In this example all paths are traversable, however as only /, /d, and /a/b are accessible only those 
nodes exist. The nodes at /a and /a/c do not exist. 
11/17/2014 61
Slide 23 
This approach leads to a clean architecture where API clients needn't care about null values or 
exceptions. Just traverse the path starting from root and check for existence at the end. 
The existence concept is implemented as a decorator of the tree model on top of the MicroKernel by a 
thin wrapper of the relevant parts of the NodeStore API. 
11/17/2014 62
Slide 25 
The content diff is key to most of Oak's higher level functionality. It allows to just process the parts of 
the content tree that changed (during or after a commit) instead of going through the whole repository. 
It is used e.g. for: 
• Validation (node types, referential integrity, item names) 
• Observation: apparently commit boundaries are get lost when the content diff is done across more 
than a single revision. This is an important limitation wrt. Jackrabbit 2 as some commit related 
information as the user or the time stamp might not always be available in Oak. 
• Indexing 
11/17/2014 63
Slide 26 
Finding out what changed between revisions is crucial for much of Oak's higher level functionality. 
Changes are extracted by comparing two trees. This is similar to a diff in a version control system but 
applied to content trees. 
A content diff runs from the root of two trees to its leaves. A node is considered changed when it has 
added/removed/changed properties or child nodes. Each changed node recursively runs a content diff 
on each of its changed child nodes. 
Comparing trees in this way is generic as it works on any (sub) trees, not only from the root. It is 
however heavily optimise for the case where an earlier revision is compared against a later revision as 
this is the case most often encountered. 
In the depicted case the sub-tree at /a doesn't need deep comparison as it is shared between both 
revisions. The diff process can stop right here. 
11/17/2014 64
Slide 27 
Merging concurrent changes relies on content diffs: first the diff between r1 and r2a is applied to r1 
followed by the diff between r1 and r2b, which will finally result in the new revision r3. 
11/17/2014 65
Slide 29 
Commit hooks and their variants provide the key plugin mechanism of Oak. Much of Oak's 
functionality is in one way or another implemented in this way: 
• Validation of node types, access control, protected or invisible content 
• Trigger for auto created or default values 
• Updates for indexes or caches 
11/17/2014 66
Slide 30 
Commit hooks rely on content diffs and allow for validation or editing of a commit before actually 
persisting it. 
11/17/2014 67
Slide 31 
A commit hook can either 
• pass a commit on to be persisted unchanged 
• fail a commit so it wont get persisted 
• edit a commit so a changed tree will be persisted 
Commit hooks are applied in sequence and tend to be expensive as most likely each of them does a 
separate content diff. 
11/17/2014 68
Slide 32 
Editors and Validators are commit hooks in disguise. They do a single content diff calling back to the 
respective implementations. This makes them considerably less expensive the raw commit hooks. 
However due to the call back based programming model they are more difficult to use. 
11/17/2014 69
Slide 34 
Observers are related to commit hooks. In fact both share the same signature as they get a before and 
an after state and can run a content diff on those processing the differences. However observers are 
run after the fact. That is, after the content has been persisted. Observers report what has happened 
while commit hooks report what might happen. 
Observers might or might not see each separate revision. After big cluster merges they might receive 
bigger chunks spanning over multiple individual commits. There might be no way to get to those 
individual commits as they might have already been garbage collected on the other cluster node. The 
important part is that each observer will see a monotonically increasing sequence of revisions. 
11/17/2014 70
Slide 35 
As JCR observation in Oak is based on observers and those might not get to see all commits, commit 
boundaries are usually lost. That is information attached to individual commits like the user or a time 
stamp are only available on a best effort basis in Oak. 
External index updates are e.g. used for integration Solr. 
11/17/2014 71
Slide 37 
Although search supports the same languages as Jackrabbit (SQL and XPath), the underlying 
implementations greatly differ. 
Oak has pluggable parsers and indexes. For each query a suitable parser is first searched and - if 
available - used to parse the query into an intermediate representation. Then all registered indexes are 
asked to provide a cost estimate for executing the query. Finally the query is executed choosing the 
cheapest index. If there is no suitable index for a query the synthetic "traverse" index will be used. This 
index, while very slow, allows each query to eventually succeed. 
Oak's approach to query execution is closer to the RDBMS world than Jackrabbit 2 as it allows for 
creating indexes to speed up queries. 
11/17/2014 72
Slide 38 
Indexes are specified in content via special index definition nodes. 
• Property index for properties of a given name. Can optionally by ordered. 
• Reference index for looking up referees of referenceable nodes 
• Lucene full text index for full text search. Stored in content replicates it automatically across 
cluster nodes and make it easy to back up along with the content. 
• Solr can be run embedded but usually runs externally 
11/17/2014 73
Slide 40 
• The MicroKernel implements the tree model 
• The NodeStore API exposes the tree model as immutable trees 
• Oak core implements mutable trees on top of the NodeStore API. 
• Plugins contribute much of the Oak core functionality like access control, validation, etc. through 
commit hooks and observers. 
• The Oak API exposes mutable trees, whose behaviour is shaped through the respective plugins 
configured. 
• JCR bindings contribute the right shaping for JCR by provisioning Oak with the right set of plugins. 
Other bindings besides JCR are possible. E.g. a HTTP binding as an alternative to the current 
WebDav implementation. Also new application could use the Oak API directly and only reuse the 
plugins as necessary for their needs. 
11/17/2014 74

Contenu connexe

Tendances

Design and architecture of Jackrabbit
Design and architecture of JackrabbitDesign and architecture of Jackrabbit
Design and architecture of JackrabbitJukka Zitting
 
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Shirshanka Das
 
Content Management With Apache Jackrabbit
Content Management With Apache JackrabbitContent Management With Apache Jackrabbit
Content Management With Apache JackrabbitJukka Zitting
 
Ractor's speed is not light-speed
Ractor's speed is not light-speedRactor's speed is not light-speed
Ractor's speed is not light-speedSATOSHI TAGOMORI
 
High Concurrency Architecture and Laravel Performance Tuning
High Concurrency Architecture and Laravel Performance TuningHigh Concurrency Architecture and Laravel Performance Tuning
High Concurrency Architecture and Laravel Performance TuningAlbert Chen
 
Best Practices for Running Kafka on Docker Containers
Best Practices for Running Kafka on Docker ContainersBest Practices for Running Kafka on Docker Containers
Best Practices for Running Kafka on Docker ContainersBlueData, Inc.
 
Infrastructure testing with Molecule and TestInfra
Infrastructure testing with Molecule and TestInfraInfrastructure testing with Molecule and TestInfra
Infrastructure testing with Molecule and TestInfraTomislav Plavcic
 
Empowering developers to deploy their own data stores
Empowering developers to deploy their own data storesEmpowering developers to deploy their own data stores
Empowering developers to deploy their own data storesTomas Doran
 
ContainerCon sysdig Slides
ContainerCon sysdig Slides ContainerCon sysdig Slides
ContainerCon sysdig Slides Loris Degioanni
 
Apache Jackrabbit Oak on MongoDB
Apache Jackrabbit Oak on MongoDBApache Jackrabbit Oak on MongoDB
Apache Jackrabbit Oak on MongoDBMongoDB
 
Apache Performance Tuning: Scaling Up
Apache Performance Tuning: Scaling UpApache Performance Tuning: Scaling Up
Apache Performance Tuning: Scaling UpSander Temme
 
Apache Performance Tuning: Scaling Out
Apache Performance Tuning: Scaling OutApache Performance Tuning: Scaling Out
Apache Performance Tuning: Scaling OutSander Temme
 
Apache Kafka
Apache KafkaApache Kafka
Apache KafkaJoe Stein
 
Developing high-performance network servers in Lisp
Developing high-performance network servers in LispDeveloping high-performance network servers in Lisp
Developing high-performance network servers in LispVladimir Sedach
 
The State of HBase Replication
The State of HBase ReplicationThe State of HBase Replication
The State of HBase ReplicationHBaseCon
 
Pulsarctl & Pulsar Manager
Pulsarctl & Pulsar ManagerPulsarctl & Pulsar Manager
Pulsarctl & Pulsar ManagerStreamNative
 
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...JAXLondon2014
 
Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11なおき きしだ
 

Tendances (20)

Design and architecture of Jackrabbit
Design and architecture of JackrabbitDesign and architecture of Jackrabbit
Design and architecture of Jackrabbit
 
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
 
Content Management With Apache Jackrabbit
Content Management With Apache JackrabbitContent Management With Apache Jackrabbit
Content Management With Apache Jackrabbit
 
Jakarta EE 8 on JDK17
Jakarta EE 8 on JDK17Jakarta EE 8 on JDK17
Jakarta EE 8 on JDK17
 
Ractor's speed is not light-speed
Ractor's speed is not light-speedRactor's speed is not light-speed
Ractor's speed is not light-speed
 
High Concurrency Architecture and Laravel Performance Tuning
High Concurrency Architecture and Laravel Performance TuningHigh Concurrency Architecture and Laravel Performance Tuning
High Concurrency Architecture and Laravel Performance Tuning
 
Best Practices for Running Kafka on Docker Containers
Best Practices for Running Kafka on Docker ContainersBest Practices for Running Kafka on Docker Containers
Best Practices for Running Kafka on Docker Containers
 
Infrastructure testing with Molecule and TestInfra
Infrastructure testing with Molecule and TestInfraInfrastructure testing with Molecule and TestInfra
Infrastructure testing with Molecule and TestInfra
 
Highlights Of Sqoop2
Highlights Of Sqoop2Highlights Of Sqoop2
Highlights Of Sqoop2
 
Empowering developers to deploy their own data stores
Empowering developers to deploy their own data storesEmpowering developers to deploy their own data stores
Empowering developers to deploy their own data stores
 
ContainerCon sysdig Slides
ContainerCon sysdig Slides ContainerCon sysdig Slides
ContainerCon sysdig Slides
 
Apache Jackrabbit Oak on MongoDB
Apache Jackrabbit Oak on MongoDBApache Jackrabbit Oak on MongoDB
Apache Jackrabbit Oak on MongoDB
 
Apache Performance Tuning: Scaling Up
Apache Performance Tuning: Scaling UpApache Performance Tuning: Scaling Up
Apache Performance Tuning: Scaling Up
 
Apache Performance Tuning: Scaling Out
Apache Performance Tuning: Scaling OutApache Performance Tuning: Scaling Out
Apache Performance Tuning: Scaling Out
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Developing high-performance network servers in Lisp
Developing high-performance network servers in LispDeveloping high-performance network servers in Lisp
Developing high-performance network servers in Lisp
 
The State of HBase Replication
The State of HBase ReplicationThe State of HBase Replication
The State of HBase Replication
 
Pulsarctl & Pulsar Manager
Pulsarctl & Pulsar ManagerPulsarctl & Pulsar Manager
Pulsarctl & Pulsar Manager
 
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
 
Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11Summary of JDK10 and What will come into JDK11
Summary of JDK10 and What will come into JDK11
 

En vedette

Building Content Applications with JCR and OSGi
Building Content Applications with JCR and OSGiBuilding Content Applications with JCR and OSGi
Building Content Applications with JCR and OSGiCédric Hüsler
 
Demystifying Oak Search
Demystifying Oak SearchDemystifying Oak Search
Demystifying Oak SearchJustin Edelson
 
Into the TarPit: A TarMK Deep Dive
Into the TarPit: A TarMK Deep DiveInto the TarPit: A TarMK Deep Dive
Into the TarPit: A TarMK Deep DiveMichael Dürig
 
Build single page applications using AngularJS on AEM
Build single page applications using AngularJS on AEMBuild single page applications using AngularJS on AEM
Build single page applications using AngularJS on AEMconnectwebex
 
JCR, Sling or AEM? Which API should I use and when?
JCR, Sling or AEM? Which API should I use and when?JCR, Sling or AEM? Which API should I use and when?
JCR, Sling or AEM? Which API should I use and when?connectwebex
 
Introduction to Sightly and Sling Models
Introduction to Sightly and Sling ModelsIntroduction to Sightly and Sling Models
Introduction to Sightly and Sling ModelsStefano Celentano
 
Oak, the Architecture of the new Repository
Oak, the Architecture of the new RepositoryOak, the Architecture of the new Repository
Oak, the Architecture of the new RepositoryMichael Dürig
 
Multi site manager
Multi site managerMulti site manager
Multi site managershivani garg
 
Adobe Meetup AEM Architecture Sydney 2015
Adobe Meetup AEM Architecture Sydney 2015Adobe Meetup AEM Architecture Sydney 2015
Adobe Meetup AEM Architecture Sydney 2015Michael Henderson
 
Microservices Architecture for AEM
Microservices Architecture for AEMMicroservices Architecture for AEM
Microservices Architecture for AEMMaciej Majchrzak
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
New Repository in AEM 6 by Michael Marth
New Repository in AEM 6 by Michael MarthNew Repository in AEM 6 by Michael Marth
New Repository in AEM 6 by Michael MarthAEM HUB
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache TikaJukka Zitting
 
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorJukka Zitting
 
The new repository in AEM 6
The new repository in AEM 6The new repository in AEM 6
The new repository in AEM 6Jukka Zitting
 
Apache Jackrabbit Oak - Scale your content repository to the cloud
Apache Jackrabbit Oak - Scale your content repository to the cloudApache Jackrabbit Oak - Scale your content repository to the cloud
Apache Jackrabbit Oak - Scale your content repository to the cloudRobert Munteanu
 

En vedette (16)

Building Content Applications with JCR and OSGi
Building Content Applications with JCR and OSGiBuilding Content Applications with JCR and OSGi
Building Content Applications with JCR and OSGi
 
Demystifying Oak Search
Demystifying Oak SearchDemystifying Oak Search
Demystifying Oak Search
 
Into the TarPit: A TarMK Deep Dive
Into the TarPit: A TarMK Deep DiveInto the TarPit: A TarMK Deep Dive
Into the TarPit: A TarMK Deep Dive
 
Build single page applications using AngularJS on AEM
Build single page applications using AngularJS on AEMBuild single page applications using AngularJS on AEM
Build single page applications using AngularJS on AEM
 
JCR, Sling or AEM? Which API should I use and when?
JCR, Sling or AEM? Which API should I use and when?JCR, Sling or AEM? Which API should I use and when?
JCR, Sling or AEM? Which API should I use and when?
 
Introduction to Sightly and Sling Models
Introduction to Sightly and Sling ModelsIntroduction to Sightly and Sling Models
Introduction to Sightly and Sling Models
 
Oak, the Architecture of the new Repository
Oak, the Architecture of the new RepositoryOak, the Architecture of the new Repository
Oak, the Architecture of the new Repository
 
Multi site manager
Multi site managerMulti site manager
Multi site manager
 
Adobe Meetup AEM Architecture Sydney 2015
Adobe Meetup AEM Architecture Sydney 2015Adobe Meetup AEM Architecture Sydney 2015
Adobe Meetup AEM Architecture Sydney 2015
 
Microservices Architecture for AEM
Microservices Architecture for AEMMicroservices Architecture for AEM
Microservices Architecture for AEM
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
New Repository in AEM 6 by Michael Marth
New Repository in AEM 6 by Michael MarthNew Repository in AEM 6 by Michael Marth
New Repository in AEM 6 by Michael Marth
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache Incubator
 
The new repository in AEM 6
The new repository in AEM 6The new repository in AEM 6
The new repository in AEM 6
 
Apache Jackrabbit Oak - Scale your content repository to the cloud
Apache Jackrabbit Oak - Scale your content repository to the cloudApache Jackrabbit Oak - Scale your content repository to the cloud
Apache Jackrabbit Oak - Scale your content repository to the cloud
 

Similaire à The architecture of oak

Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introductionkanedafromparis
 
The Source Control Landscape
The Source Control LandscapeThe Source Control Landscape
The Source Control LandscapeLorna Mitchell
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Junping Du
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2hdhappy001
 
Scalable Persistent Storage for Erlang: Theory and Practice
Scalable Persistent Storage for Erlang: Theory and PracticeScalable Persistent Storage for Erlang: Theory and Practice
Scalable Persistent Storage for Erlang: Theory and PracticeAmir Ghaffari
 
An Updated Performance Comparison of Virtual Machines and Linux Containers
An Updated Performance Comparison of Virtual Machines and Linux ContainersAn Updated Performance Comparison of Virtual Machines and Linux Containers
An Updated Performance Comparison of Virtual Machines and Linux ContainersKento Aoyama
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSSteve Wong
 
OpenEBS; asymmetrical block layer in user-space breaking the million IOPS bar...
OpenEBS; asymmetrical block layer in user-space breaking the million IOPS bar...OpenEBS; asymmetrical block layer in user-space breaking the million IOPS bar...
OpenEBS; asymmetrical block layer in user-space breaking the million IOPS bar...MayaData
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on DockerDataWorks Summit
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Chris Nauroth
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...HostedbyConfluent
 
State of the Container Ecosystem
State of the Container EcosystemState of the Container Ecosystem
State of the Container EcosystemVinay Rao
 
Building a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native EraBuilding a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native EraAlluxio, Inc.
 
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph Ceph Community
 
Ceph Day NYC: Building Tomorrow's Ceph
Ceph Day NYC: Building Tomorrow's CephCeph Day NYC: Building Tomorrow's Ceph
Ceph Day NYC: Building Tomorrow's CephCeph Community
 

Similaire à The architecture of oak (20)

Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
 
The Source Control Landscape
The Source Control LandscapeThe Source Control Landscape
The Source Control Landscape
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
 
Solr 4
Solr 4Solr 4
Solr 4
 
Scalable Persistent Storage for Erlang: Theory and Practice
Scalable Persistent Storage for Erlang: Theory and PracticeScalable Persistent Storage for Erlang: Theory and Practice
Scalable Persistent Storage for Erlang: Theory and Practice
 
An Updated Performance Comparison of Virtual Machines and Linux Containers
An Updated Performance Comparison of Virtual Machines and Linux ContainersAn Updated Performance Comparison of Virtual Machines and Linux Containers
An Updated Performance Comparison of Virtual Machines and Linux Containers
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OS
 
OpenEBS; asymmetrical block layer in user-space breaking the million IOPS bar...
OpenEBS; asymmetrical block layer in user-space breaking the million IOPS bar...OpenEBS; asymmetrical block layer in user-space breaking the million IOPS bar...
OpenEBS; asymmetrical block layer in user-space breaking the million IOPS bar...
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Ceph
CephCeph
Ceph
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
 
LOD2 Plenary Vienna 2012: WP2 - Storing and Querying Very Large Knowledge Bases
LOD2 Plenary Vienna 2012: WP2 - Storing and Querying Very Large Knowledge BasesLOD2 Plenary Vienna 2012: WP2 - Storing and Querying Very Large Knowledge Bases
LOD2 Plenary Vienna 2012: WP2 - Storing and Querying Very Large Knowledge Bases
 
State of the Container Ecosystem
State of the Container EcosystemState of the Container Ecosystem
State of the Container Ecosystem
 
Building a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native EraBuilding a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native Era
 
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
Ceph Day Santa Clara: Keynote: Building Tomorrow's Ceph
 
Ceph Day NYC: Building Tomorrow's Ceph
Ceph Day NYC: Building Tomorrow's CephCeph Day NYC: Building Tomorrow's Ceph
Ceph Day NYC: Building Tomorrow's Ceph
 

Dernier

Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine HarmonyLeveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmonyelliciumsolutionspun
 
Webinar - IA generativa e grafi Neo4j: RAG time!
Webinar - IA generativa e grafi Neo4j: RAG time!Webinar - IA generativa e grafi Neo4j: RAG time!
Webinar - IA generativa e grafi Neo4j: RAG time!Neo4j
 
AI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human BeautyAI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human BeautyRaymond Okyere-Forson
 
JS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIJS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIIvo Andreev
 
IA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeIA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeNeo4j
 
How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?AmeliaSmith90
 
Mastering Kubernetes - Basics and Advanced Concepts using Example Project
Mastering Kubernetes - Basics and Advanced Concepts using Example ProjectMastering Kubernetes - Basics and Advanced Concepts using Example Project
Mastering Kubernetes - Basics and Advanced Concepts using Example Projectwajrcs
 
About .NET 8 and a first glimpse into .NET9
About .NET 8 and a first glimpse into .NET9About .NET 8 and a first glimpse into .NET9
About .NET 8 and a first glimpse into .NET9Jürgen Gutsch
 
ERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxAutus Cyber Tech
 
Kubernetes go-live checklist for your microservices.pptx
Kubernetes go-live checklist for your microservices.pptxKubernetes go-live checklist for your microservices.pptx
Kubernetes go-live checklist for your microservices.pptxPrakarsh -
 
20240330_고급진 코드를 위한 exception 다루기
20240330_고급진 코드를 위한 exception 다루기20240330_고급진 코드를 위한 exception 다루기
20240330_고급진 코드를 위한 exception 다루기Chiwon Song
 
Watermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesWatermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesShyamsundar Das
 
Kawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLAlluxio, Inc.
 
Top Software Development Trends in 2024
Top Software Development Trends in  2024Top Software Development Trends in  2024
Top Software Development Trends in 2024Mind IT Systems
 
OpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorOpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorShane Coughlan
 
eAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionseAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionsNirav Modi
 
online pdf editor software solutions.pdf
online pdf editor software solutions.pdfonline pdf editor software solutions.pdf
online pdf editor software solutions.pdfMeon Technology
 

Dernier (20)

Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine HarmonyLeveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
Leveraging DxSherpa's Generative AI Services to Unlock Human-Machine Harmony
 
Webinar - IA generativa e grafi Neo4j: RAG time!
Webinar - IA generativa e grafi Neo4j: RAG time!Webinar - IA generativa e grafi Neo4j: RAG time!
Webinar - IA generativa e grafi Neo4j: RAG time!
 
AI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human BeautyAI Embracing Every Shade of Human Beauty
AI Embracing Every Shade of Human Beauty
 
JS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AIJS-Experts - Cybersecurity for Generative AI
JS-Experts - Cybersecurity for Generative AI
 
IA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeIA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG time
 
How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?How Does the Epitome of Spyware Differ from Other Malicious Software?
How Does the Epitome of Spyware Differ from Other Malicious Software?
 
Program with GUTs
Program with GUTsProgram with GUTs
Program with GUTs
 
Mastering Kubernetes - Basics and Advanced Concepts using Example Project
Mastering Kubernetes - Basics and Advanced Concepts using Example ProjectMastering Kubernetes - Basics and Advanced Concepts using Example Project
Mastering Kubernetes - Basics and Advanced Concepts using Example Project
 
About .NET 8 and a first glimpse into .NET9
About .NET 8 and a first glimpse into .NET9About .NET 8 and a first glimpse into .NET9
About .NET 8 and a first glimpse into .NET9
 
ERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptxERP For Electrical and Electronics manufecturing.pptx
ERP For Electrical and Electronics manufecturing.pptx
 
Kubernetes go-live checklist for your microservices.pptx
Kubernetes go-live checklist for your microservices.pptxKubernetes go-live checklist for your microservices.pptx
Kubernetes go-live checklist for your microservices.pptx
 
20240330_고급진 코드를 위한 exception 다루기
20240330_고급진 코드를 위한 exception 다루기20240330_고급진 코드를 위한 exception 다루기
20240330_고급진 코드를 위한 exception 다루기
 
Sustainable Web Design - Claire Thornewill
Sustainable Web Design - Claire ThornewillSustainable Web Design - Claire Thornewill
Sustainable Web Design - Claire Thornewill
 
Watermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security ChallengesWatermarking in Source Code: Applications and Security Challenges
Watermarking in Source Code: Applications and Security Challenges
 
Kawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in TrivandrumKawika Technologies pvt ltd Software Development Company in Trivandrum
Kawika Technologies pvt ltd Software Development Company in Trivandrum
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Top Software Development Trends in 2024
Top Software Development Trends in  2024Top Software Development Trends in  2024
Top Software Development Trends in 2024
 
OpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS CalculatorOpenChain Webinar: Universal CVSS Calculator
OpenChain Webinar: Universal CVSS Calculator
 
eAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspectionseAuditor Audits & Inspections - conduct field inspections
eAuditor Audits & Inspections - conduct field inspections
 
online pdf editor software solutions.pdf
online pdf editor software solutions.pdfonline pdf editor software solutions.pdf
online pdf editor software solutions.pdf
 

The architecture of oak

  • 1. Oak, the Architecture of the new Repository Michael Dürig, Adobe Research Switzerland
  • 2. Design goals • Scalable • Big repositories • Clustering • Customisable, flexible • OSGi friendly 11/17/2014 2
  • 3. Outline • CRUD • Changes • Search • Big Picture 11/17/2014 3
  • 4. Tree model a d b c 11/17/2014 4
  • 5. Updating ? x a d b c 11/17/2014 5
  • 6. MVCC HEAD r1: / r2: / a d b c r1: /d r2: /d r1: /a/b r2: /a/b r2: /d/x 11/17/2014 6
  • 7. Refresh and Garbage Collection
  • 11. Concurrent updates r2a r1 r2b 11/17/2014 11
  • 12. Merging merge r2a updates r1 r3 r2b 11/17/2014 12
  • 13. Conflict handing: serialisation • Fully serialised – Fail, no concurrent update • Partially serialised – Concurrent conflict free updates 11/17/2014 13
  • 14. Conflict handling strategies: merging • Partial merge – Conflict markers, deferred resolution • Full merge – Need to choose victim 11/17/2014 14
  • 16. Replica and caches master copy full replica cache 11/17/2014 16
  • 17. Sharding strategies by path by level by hash with caching 11/17/2014 17
  • 19. MicroKernel / NodeStore • Tree / Revision model implementation Responsible for Clustering Sharding Caching Conflict handling Not responsible for Validation Access control Search Versioning 11/17/2014 19
  • 20. Current implementations DocumentMK TarMK (SegmentMK) Persistence MongoDB, JDBC Local FS Conflict handling Partial serialisation Full serialisation Clustering MongoDB clustering Simple failover Sharding MongoDB sharding N/A Node Performance Moderate High Key use cases Large deployments (>1TB), concurrent writes Small/medium deployments, mostly read 11/17/2014 20
  • 22. Accessible paths a d b c 11/17/2014 22
  • 23. xistentialism • All paths traversable – Node may not exist – Decorator on NodeStore root.getChildNode("a").exists(); root.getChildNode("a") .getChildNode("b").exists(); ⟹ false ⟹ true 11/17/2014 23
  • 25. Content diff • What changed between trees • Cornerstone for – Validation – Indexing – Observation – … 11/17/2014 25
  • 26. What changed? Δ 11/17/2014 26
  • 27. Example: merging r1 r2a r2b r3 Δ r1 ➞ r2a “a” modified “b” removed Δ r1 ➞ r2b “d” modified “x” added 11/17/2014 27
  • 29. Commit hooks • Key plugin mechanism – Higher level functionality • Validation (node type, access control, …) • Trigger (auto create, defaults, …) • Updates (index, …) 11/17/2014 29
  • 30. Editing a commit Δ Δ + x 11/17/2014 30
  • 31. Commit hooks • Based on content diff – pass a commit – fail a commit – edit a commit • Applied in sequence 11/17/2014 31
  • 32. Type of hooks CommitHook Editor Validator Content diff Optional Always Always Can modify Yes Yes No Programming model Simple Callbacks Callbacks Performance impact High Medium Low 11/17/2014 32
  • 34. Observers • Observe changes – After commit – Often does a content diff – Asynchronous – Optionally synchronous • Local cluster node only 11/17/2014 34
  • 35. Examples • JCR observation • External index update • Cache invalidation • Logging 11/17/2014 35
  • 37. Query Engine SELECT WHERE x=y /a//* Parse r parse execute post process Parse r Parser Parser Index Index Index Traverse 11/17/2014 37
  • 38. Index Implementations • Property (ordered) • Reference • Lucene – In-content or file system • Solr – Embedded or external 11/17/2014 38
  • 40. Big picture JCR API Oak JCR Oak API Oak Core NodeStore API MicroKernel Plugins 11/17/2014 40
  • 43. Resources http://jackrabbit.apache.org/oak/ http://jackrabbit.apache.org/oak/docs/ https://svn.apache.org/repos/asf/jackrabbit/o ak/trunk/ 11/17/2014 43
  • 45. Slide 1 This presentation is mainly about Oak’s architecture and design. Understanding these concepts gives crucial insight in how to make the most out of Oak and to why Oak might behave differently than Jackrabbit 2 in some cases. 11/17/2014 45
  • 46. Slide 2 Jackrabbit Oak started early 2012 with some initial ideas dating back as far as 2008. It became necessary as many parts of Jackrabbit 2 outgrew their original design. Most of Jackrabbit 2’s features date back to the 90-ies and are not well suited for today's requirements. Oak was designed to overcome those challenges and to serve as the foundation of modern web content management systems. Key design goals: • scalable writes. The web is not read only any more. • large amounts of data. There is much more as a few web pages nowadays. • Built in clustering. Instead of built on top • Customisable • OSGi friendly Since Oak doesn't need to be the JCR reference implementation, we gained some additional design space by not having to implement all of the optional features (like e.g. same name siblings and support for multiple work spaces). 11/17/2014 46
  • 47. Slide 3 • CRUD: this presentation first covers the underlying persistence model: the tree model and basic create, read, update and delete operations. • Changes: being able to track changes between different revisions of a tree turns out to be crucial for building higher level functionality. • Search: while nothing much changed on the outside, search is completely different in Oak wrt. Jackrabbit 2. 11/17/2014 47
  • 48. Slide 4 Let’s consider a simple hierarchy of nodes. Each node (except the root) has a single parent and any number of child nodes. The parent-child relationships are named, i.e. each child has a unique name within its parent. This makes it possible to uniquely identify any node using its path: a user can access all content by path starting from the root node. This is a key different to Jackrabbit 2 where each node was assigned an unique id to look it up from the persistence store. In Oak nodes are always addressed its path from the root. In this sense Oak stores (sub) trees while Jackrabbit 2 stores key value pairs. In Oak one traverses down from the root following a path while in Jackrabbit 2 traversal was from a node to its parent up to the root. • Tree persistence vs. key/value persistence • Path vs. UID as primary identifier • Traversing down vs. traversing up 11/17/2014 48
  • 49. Slide 5 Let’s consider what happens when another user updates parts of the tree. For example adds a new node at /d/x. Such in place changes might confuse other users whose tree suddenly change. This is how Jackrabbit 2 works, each update is immediately made visible to all users. Unfortunately, beyond the potential for confusion, this design turns out to be a major concurrency bottleneck, as the synchronisation overhead of keeping everyone aware of all changes as they happen becomes very high. The existing Jackrabbit architecture was heavily optimized for mostly-read use cases, with only occasional and rarely concurrent content updates. Unfortunately that optimisation no longer works too well with increasingly interactive web sites and other content applications where all users are potential content editors. More generally the way such state transitions are handled has a major impact on how efficiently a system can scale up to handle lots of concurrent updates. Many noSQL systems use the concept of eventual consistency which leaves the rate (and often order) at which new updates become visible to users undefined. This solves the concurrency issue, but can lead to even more confusion as it might not be possible to clearly define the exact state of the repository. The hierarchical structure of Oak allows us to solve both of these issues by borrowing an idea from version control systems like Git or Subversion. 11/17/2014 49
  • 50. Slide 6 Instead of overwriting the existing content, a new revision of content is created, with new copies of the parent nodes all the way to the root if needed. This allows all users to keep accessing their revision of content regardless of what changes are being made elsewhere. All users are under the impression they operate on a private copy of the whole repository (MVCC) To make this work, each revision is assigned a unique revision identifier, and all paths and content accessed are evaluated in the context of a specific revision. For example the original version of the /d node would be found by following that path within revision r1, and the new version of the node by following the path in revision r2. The unchanged node /a/c would be reachable and identical through both revision r1 and r2. Additionally the repository keeps track of the HEAD revision that records what the latest state of the repository is. A new user that has not already accessed the repository would start with the HEAD revision. The ordered list of the current HEAD revision together with all former HEAD revisions form a journal, which represent a linear sequence of changes the repository went through until eventually reaching the current HEAD revision. 11/17/2014 50
  • 51. Slide 8 Let’s consider what happens when a user on an old revision wants to go forward the HEAD revision. Unlike with classic Jackrabbit, where this would always happen automatically, in Oak this state transition is handled explicitly as a refresh operation. Refresh in Oak is explicit where it was implicit in Jackrabbit 2. For backward compatibility Oak provides some "auto refresh" logic. See the Oak documentation for further details. 11/17/2014 51
  • 52. Slide 9 After the refresh, the older versions of specific nodes may no longer be accessible by any clients, and will thus become garbage, which the system will eventually collect. This is a key difference to a version control system, and allows Oak repositories to work even with workloads like thousands of updates per second without running out of space. 11/17/2014 52
  • 53. Slide 11 What about concurrent updates? Here we have two competing revisions, r2a and r2b. Both modify the same base revision r1. The r2a revision removes the node at /a/b, and r2b is the revision we saw earlier, which add a new node at /d/x. 11/17/2014 53
  • 54. Slide 12 We start with the base revision r1. Then the two new revisions are created concurrently. Finally, the system will automatically merge the results, which will create the new revision r3 containing the merged changes from r2a and r2b. The merge could happen in different cluster nodes, different data centres or even different local disconnected copies (e.g. like git clone). 11/17/2014 54
  • 55. Slide 13 What happens when merging is not possible because two concurrent revisions conflict? There are several strategies for handling this case: • Full serialisation: don't allow concurrent updates and fail. This heavily impacts the write rate in the face of many concurrent writers. • Partial serialisation: instead of locking on the whole tree, lock on the root of the modified sub tree. Allows concurrent updates of changes to separate sub trees. Both strategies are currently implemented by Oak depending on the persistence back-end in use. 11/17/2014 55
  • 56. Slide 14 Instead of pure serialisation a more sophisticated approach would be to semantically merge conflicting changes. Generally this needs domain knowledge and cannot be fully implemented in the persistence layer. • Partial merging: persist conflict markers along with the conflicting changes for deferred resolution by e.g. and administrator. • Full merging: this inevitable leads to data loss as one of the changes have to be preferred over the other. This is unless we can establish a total order over all revisions, which is usually not the case. Oak currently does not implement these strategies, though they could be plugged in. 11/17/2014 56
  • 57. Slide 16 Replicas are straight forward: each replica only needs to follow the primary. Thanks to the immutable append only model of persistence, there is no need for invalidation. Replicas can do their individual garbage collections cycles. As a variant a replica can also serve as cache by only following frequently accessed parts of the tree and ignoring updates to other parts. 11/17/2014 57
  • 58. Slide 17 Sharding by path is the most straight forward variant. However as sub tree sizes tend to be not evenly distributed shards end up to greatly vary in size. Sharding by level turn out to be even more problematic as there will be more content the deeper the tree while the root only has a single node. Although there are more revisions on the root (as every change creates a new root node), garbage collection will keep that number within reasonable limits. Sharding by content hash is what the DocumentMK does and which turned out to be most effective. A drawback of this approach is the loss of locality: the nodes of a path tend to be spread across various shards forcing navigational access to access multiple shards. An idea for addressing this is to allow each shard to cache a node's parent nodes. Since the lower levels contains the most content caching the parents is cheap while at the same time it restores locality. 11/17/2014 58
  • 59. Slide 19 Unfortunately there is a bit of a naming confusion in Oak as the terms MicroKernel and NodeStore might both refer to APIs and to architectural components and are used somewhat interchangeably. One way to think of it is that the MicroKernel is an implementation of the tree model. While the NodeStore is a Java API for accessing the nodes of that tree model. A MicroKernel implementation is responsible for all the immutable update mechanics of the content tree along with conflict handling, garbage collection, clustering and sharding. It doesn't have any higher level functionality like validation, complex data types, access control, search or versioning. All the latter a implemented on to of the NodeStore API. 11/17/2014 59
  • 60. Slide 20 Oak comes with basically two MicroKernel implementations: the DocumentMK (formerly MongoMK) and the TarMK (aka SegmentMK). The DocumentMK started out as being MongoDB backed but is now more flexible regarding the choice of the back-end. A JDBC back-end is currently being worked on. • The DocumentMK leverages the clustering and sharding capabilities of the underlying back-ends and implements partial serialisation. Due to the extra network layer involved, it has moderate single node performance, but scales with additional cluster nodes. • The TarMK uses local tar files for storage. It is not currently not cluster aware and only offers a simply fail over mechanism but offers maximal single node performance. The two MicroKernel implementations cover different uses cases: where DocumentMK is preliminary for large deployments involving concurrent writes, the TarMK is better suited to small to medium deployments, which are mostly read. We expect the gap between the two implementations to decrease in the future as there are ideas to make the TarMK more cluster ready while the DocumentMK still has room for improved performance. Finally other implementations (e.g. Hadoop based) are quite possible with the JDBC back-end for DocumentMK most likely being the first one. 11/17/2014 60
  • 61. Slide 22 Access control is probably the most trick feature of JCR to implement. However it is also a very prominent feature and might as well be considered "the killer feature" for quite some content centric applications as implementing fine grained access control on top of applications is difficult and error prone. Access control in JCR allows a node to be accessible while it's parent isn't. This is contrary to the tree model where all access is by path traversing down from the root of the tree. We solved this in Oak by introducing the concept of "existence" for a node. Nodes that are not accessible are still traversable but do not exist. That is, the result of asking for a non accessible child node is indistinguishable from a "really" non existing child: both do not exist. With this all syntactically correct paths eventually resolve to a node, which however might not exist. In this example all paths are traversable, however as only /, /d, and /a/b are accessible only those nodes exist. The nodes at /a and /a/c do not exist. 11/17/2014 61
  • 62. Slide 23 This approach leads to a clean architecture where API clients needn't care about null values or exceptions. Just traverse the path starting from root and check for existence at the end. The existence concept is implemented as a decorator of the tree model on top of the MicroKernel by a thin wrapper of the relevant parts of the NodeStore API. 11/17/2014 62
  • 63. Slide 25 The content diff is key to most of Oak's higher level functionality. It allows to just process the parts of the content tree that changed (during or after a commit) instead of going through the whole repository. It is used e.g. for: • Validation (node types, referential integrity, item names) • Observation: apparently commit boundaries are get lost when the content diff is done across more than a single revision. This is an important limitation wrt. Jackrabbit 2 as some commit related information as the user or the time stamp might not always be available in Oak. • Indexing 11/17/2014 63
  • 64. Slide 26 Finding out what changed between revisions is crucial for much of Oak's higher level functionality. Changes are extracted by comparing two trees. This is similar to a diff in a version control system but applied to content trees. A content diff runs from the root of two trees to its leaves. A node is considered changed when it has added/removed/changed properties or child nodes. Each changed node recursively runs a content diff on each of its changed child nodes. Comparing trees in this way is generic as it works on any (sub) trees, not only from the root. It is however heavily optimise for the case where an earlier revision is compared against a later revision as this is the case most often encountered. In the depicted case the sub-tree at /a doesn't need deep comparison as it is shared between both revisions. The diff process can stop right here. 11/17/2014 64
  • 65. Slide 27 Merging concurrent changes relies on content diffs: first the diff between r1 and r2a is applied to r1 followed by the diff between r1 and r2b, which will finally result in the new revision r3. 11/17/2014 65
  • 66. Slide 29 Commit hooks and their variants provide the key plugin mechanism of Oak. Much of Oak's functionality is in one way or another implemented in this way: • Validation of node types, access control, protected or invisible content • Trigger for auto created or default values • Updates for indexes or caches 11/17/2014 66
  • 67. Slide 30 Commit hooks rely on content diffs and allow for validation or editing of a commit before actually persisting it. 11/17/2014 67
  • 68. Slide 31 A commit hook can either • pass a commit on to be persisted unchanged • fail a commit so it wont get persisted • edit a commit so a changed tree will be persisted Commit hooks are applied in sequence and tend to be expensive as most likely each of them does a separate content diff. 11/17/2014 68
  • 69. Slide 32 Editors and Validators are commit hooks in disguise. They do a single content diff calling back to the respective implementations. This makes them considerably less expensive the raw commit hooks. However due to the call back based programming model they are more difficult to use. 11/17/2014 69
  • 70. Slide 34 Observers are related to commit hooks. In fact both share the same signature as they get a before and an after state and can run a content diff on those processing the differences. However observers are run after the fact. That is, after the content has been persisted. Observers report what has happened while commit hooks report what might happen. Observers might or might not see each separate revision. After big cluster merges they might receive bigger chunks spanning over multiple individual commits. There might be no way to get to those individual commits as they might have already been garbage collected on the other cluster node. The important part is that each observer will see a monotonically increasing sequence of revisions. 11/17/2014 70
  • 71. Slide 35 As JCR observation in Oak is based on observers and those might not get to see all commits, commit boundaries are usually lost. That is information attached to individual commits like the user or a time stamp are only available on a best effort basis in Oak. External index updates are e.g. used for integration Solr. 11/17/2014 71
  • 72. Slide 37 Although search supports the same languages as Jackrabbit (SQL and XPath), the underlying implementations greatly differ. Oak has pluggable parsers and indexes. For each query a suitable parser is first searched and - if available - used to parse the query into an intermediate representation. Then all registered indexes are asked to provide a cost estimate for executing the query. Finally the query is executed choosing the cheapest index. If there is no suitable index for a query the synthetic "traverse" index will be used. This index, while very slow, allows each query to eventually succeed. Oak's approach to query execution is closer to the RDBMS world than Jackrabbit 2 as it allows for creating indexes to speed up queries. 11/17/2014 72
  • 73. Slide 38 Indexes are specified in content via special index definition nodes. • Property index for properties of a given name. Can optionally by ordered. • Reference index for looking up referees of referenceable nodes • Lucene full text index for full text search. Stored in content replicates it automatically across cluster nodes and make it easy to back up along with the content. • Solr can be run embedded but usually runs externally 11/17/2014 73
  • 74. Slide 40 • The MicroKernel implements the tree model • The NodeStore API exposes the tree model as immutable trees • Oak core implements mutable trees on top of the NodeStore API. • Plugins contribute much of the Oak core functionality like access control, validation, etc. through commit hooks and observers. • The Oak API exposes mutable trees, whose behaviour is shaped through the respective plugins configured. • JCR bindings contribute the right shaping for JCR by provisioning Oak with the right set of plugins. Other bindings besides JCR are possible. E.g. a HTTP binding as an alternative to the current WebDav implementation. Also new application could use the Oak API directly and only reuse the plugins as necessary for their needs. 11/17/2014 74

Notes de l'éditeur

  1. This presentation is mainly about Oak’s architecture and design. Understanding these concepts gives crucial insight in how to make the most out of Oak and to why Oak might behave differently than Jackrabbit 2 in some cases.
  2. Jackrabbit Oak started early 2012 with some initial ideas dating back as far as 2008. It became necessary as many parts of Jackrabbit 2 outgrew their original design. Most of Jackrabbit 2’s features date back to the 90-ies and are not well suited for today's requirements. Oak was designed to overcome those challenges and to serve as the foundation of modern web content management systems. Key design goals: * scalable writes. The web is not read only any more. * large amounts of data. There is much more as a few web pages nowadays. * Built in clustering. Instead of built on top * Customisable * OSGi friendly Since Oak doesn't need to be the JCR reference implementation, we gained some additional design space by not having to implement all of the optional features (like e.g. same name siblings and support for multiple work spaces).
  3. * CRUD: this presentation first covers the underlying persistence model: the tree model and basic create, read, update and delete operations. * Changes: being able to track changes between different revisions of a tree turns out to be crucial for building higher level functionality. * Search: while nothing much changed on the outside, search is completely different in Oak wrt. Jackrabbit 2.
  4. Let’s consider a simple hierarchy of nodes. Each node (except the root) has a single parent and any number of child nodes. The parent-child relationships are named, i.e. each child has a unique name within its parent. This makes it possible to uniquely identify any node using its path: a user can access all content by path starting from the root node. This is a key different to Jackrabbit 2 where each node was assigned an unique id to look it up from the persistence store. In Oak nodes are always addressed its path from the root. In this sense Oak stores (sub) trees while Jackrabbit 2 stores key value pairs. In Oak one traverses down from the root following a path while in Jackrabbit 2 traversal was from a node to its parent up to the root. * Tree persistence vs. key/value persistence * Path vs. UID as primary identifier * Traversing down vs. traversing up
  5. Let’s consider what happens when another user updates parts of the tree. For example adds a new node at /d/x. Such in place changes might confuse other users whose tree suddenly change. This is how Jackrabbit 2 works, each update is immediately made visible to all users. Unfortunately, beyond the potential for confusion, this design turns out to be a major concurrency bottleneck, as the synchronisation overhead of keeping everyone aware of all changes as they happen becomes very high. The existing Jackrabbit architecture was heavily optimized for mostly-read use cases, with only occasional and rarely concurrent content updates. Unfortunately that optimisation no longer works too well with increasingly interactive web sites and other content applications where all users are potential content editors. More generally the way such state transitions are handled has a major impact on how efficiently a system can scale up to handle lots of concurrent updates. Many noSQL systems use the concept of eventual consistency which leaves the rate (and often order) at which new updates become visible to users undefined. This solves the concurrency issue, but can lead to even more confusion as it might not be possible to clearly define the exact state of the repository. The hierarchical structure of Oak allows us to solve both of these issues by borrowing an idea from version control systems like Git or Subversion.
  6. Instead of overwriting the existing content, a new revision of content is created, with new copies of the parent nodes all the way to the root if needed. This allows all users to keep accessing their revision of content regardless of what changes are being made elsewhere. All users are under the impression they operate on a private copy of the whole repository (MVCC) To make this work, each revision is assigned a unique revision identifier, and all paths and content accessed are evaluated in the context of a specific revision. For example the original version of the /d node would be found by following that path within revision r1, and the new version of the node by following the path in revision r2. The unchanged node /a/c would be reachable and identical through both revision r1 and r2. Additionally the repository keeps track of the HEAD revision that records what the latest state of the repository is. A new user that has not already accessed the repository would start with the HEAD revision. The ordered list of the current HEAD revision together with all former HEAD revisions form a journal, which represent a linear sequence of changes the repository went through until eventually reaching the current HEAD revision. * The tree model is key to understanding how Oak works * Ideas borrowed from VCS like Subversion and Git: a tree with immutable update goodies
  7. Let’s consider what happens when a user on an old revision wants to go forward to the HEAD revision. Unlike with classic Jackrabbit, where this would always happen automatically, in Oak this state transition is handled explicitly as a refresh operation. Refresh in Oak is explicit where it was implicit in Jackrabbit 2. For backward compatibility Oak provides some "auto refresh" logic. See the Oak documentation for further details.
  8. After the refresh, the older versions of specific nodes may no longer be accessible by any clients, and will thus become garbage, which the system will eventually collect. This is a key difference to a version control system, and allows Oak repositories to work even with workloads like thousands of updates per second without running out of space.
  9. What about concurrent updates? Here we have two competing revisions, r2a and r2b. Both modify the same base revision r1. The r2a revision removes the node at /a/b, and r2b is the revision we saw earlier, which add a new node at /d/x.
  10. We start with the base revision r1. Then the two new revisions are created concurrently. Finally, the system will automatically merge the results, which will create the new revision r3 containing the merged changes from r2a and r2b. The merge could happen in different cluster nodes, different data centres or even different local disconnected copies (e.g. like git clone).
  11. What happens when merging is not possible because two concurrent revisions conflict? There are several strategies for handling this case: * Full serialisation: don't allow concurrent updates and fail. This heavily impacts the write rate in the face of many concurrent writers. * Partial serialisation: instead of locking on the whole tree, lock on the root of the modified sub tree. Allows concurrent updates of changes to separate sub trees. Both strategies are currently implemented by Oak depending on the persistence back-end in use.
  12. Instead of pure serialisation a more sophisticated approach would be to semantically merge conflicting changes. Generally this needs domain knowledge and cannot be fully implemented in the persistence layer. * Partial merging: persist conflict markers along with the conflicting changes for deferred resolution by e.g. and administrator. * Full merging: this inevitable leads to data loss as one of the changes have to be preferred over the other. This is unless we can establish a total order over all revisions, which is usually not the case. Oak currently does not implement these strategies, though they could be plugged in.
  13. Replicas are straight forward: each replica only needs to follow the primary. Thanks to the immutable append only model of persistence, there is no need for invalidation. Replicas can do their individual garbage collections cycles. As a variant a replica can also serve as cache by only following frequently accessed parts of the tree and ignoring updates to other parts.
  14. Sharding by path is the most straight forward variant. However as sub tree sizes tend to be not evenly distributed shards end up to greatly vary in size. Sharding by level turn out to be even more problematic as there will be more content the deeper the tree while the root only has a single node. Although there are more revisions on the root (as every change creates a new root node), garbage collection will keep that number within reasonable limits. Sharding by content hash is what the DocumentMK does and which turned out to be most effective. A drawback of this approach is the loss of locality: the nodes of a path tend to be spread across various shards forcing navigational access to access multiple shards. An idea for addressing this is to allow each shard to cache a node's parent node. Since the lower levels contains the most content caching the parents is cheap while at the same time it restores locality.
  15. Unfortunately there is a bit of a naming confusion in Oak as the terms MicroKernel and NodeStore might both refer to APIs and to architectural components and are used somewhat interchangeably. One way to think of it is that the MicroKernel is an implementation of the tree model. While the NodeStore is a Java API for accessing the nodes of that tree model. A MicroKernel implementation is responsible for all the immutable update mechanics of the content tree along with conflict handling, garbage collection, clustering and sharding. It doesn't have any higher level functionality like validation, complex data types, access control, search or versioning. All the latter a implemented on to of the NodeStore API.
  16. Oak comes with basically two MicroKernel implementations: the DocumentMK (formerly MongoMK) and the TarMK (aka SegmentMK). The DocumentMK started out as being MongoDB backed but is now more flexible regarding the choice of the back-end. A JDBC back-end is currently being worked on. * The DocumentMK leverages the clustering and sharding capabilities of the underlying back-ends and implements partial serialisation. Due to the extra network layer involved, it has moderate single node performance, but scales with additional cluster nodes. * The TarMK uses local tar files for storage. It is not currently not cluster aware and only offers a simply fail over mechanism but offers maximal single node performance. The two MicroKernel implementations cover different uses cases: where DocumentMK is preliminary for large deployments involving concurrent writes, the TarMK is better suited to small to medium deployments, which are mostly read. We expect the gap between the two implementations to decrease in the future as there are ideas to make the TarMK more cluster ready while the DocumentMK still has room for improved performance. Finally other implementations (e.g. Hadoop based) are quite possible with the JDBC back-end for DocumentMK most likely being the first one.
  17. Access control is probably the most trick feature of JCR to implement. However it is also a very prominent feature and might as well be considered "the killer feature" for quite some content centric applications as implementing fine grained access control on top of applications is difficult and error prone. Access control in JCR allows a node to be accessible while it's parent isn't. This is contrary to the tree model where all access is by path traversing down from the root of the tree. We solved this in Oak by introducing the concept of "existence" for a node. Nodes that are not accessible are still traversable but do not exist. That is, the result of asking for a non accessible child node is indistinguishable from a "really" non existing child: both do not exist. With this all syntactically correct paths eventually resolve to a node, which however might not exist. In this example all paths are traversable, however as only /, /d, and /a/b are accessible only those nodes exist. The nodes at /a and /a/c do not exist.
  18. This approach leads to a clean architecture where API clients needn't care about null values or exceptions. Just traverse the path starting from root and check for existence at the end. The existence concept is implemented as a decorator of the tree model on top of the MicroKernel by a thin wrapper of the relevant parts of the NodeStore API.
  19. The content diff is key to most of Oak's higher level functionality. It allows to just process the parts of the content tree that changed (during or after a commit) instead of going through the whole repository. It is used e.g. for: * Validation (node types, referential integrity, item names) * Observation: apparently commit boundaries are get lost when the content diff is done across more than a single revision. This is an important limitation wrt. Jackrabbit 2 as some commit related information as the user or the time stamp might not always be available in Oak. * Indexing
  20. Finding out what changed between revisions is crucial for much of Oak's higher level functionality. Changes are extracted by comparing two trees. This is similar to a diff in a version control system but applied to content trees. A content diff runs from the root of two trees to its leaves. A node is considered changed when it has added/removed/changed properties or child nodes. Each changed node recursively runs a content diff on each of its changed child nodes. Comparing trees in this way is generic as it works on any (sub) trees, not only from the root. It is however heavily optimise for the case where an earlier revision is compared against a later revision as this is the case most often encountered. In the depicted case the sub-tree at /a doesn't need deep comparison as it is shared between both revisions. The diff process can stop right here.
  21. Merging concurrent changes relies on content diffs: first the diff between r1 and r2a is applied to r1 followed by the diff between r1 and r2b, which will finally result in the new revision r3.
  22. Commit hooks and their variants provide the key plugin mechanism of Oak. Much of Oak's functionality is in one way or another implemented in this way: * Validation of node types, access control, protected or invisible content * Trigger for auto created or default values * Updates for indexes or caches
  23. Commit hooks rely on content diffs and allow for validation or editing of a commit before actually persisting it.
  24. A commit hook can either * pass a commit on to be persisted unchanged * fail a commit so it wont get persisted * edit a commit so a changed tree will be persisted Commit hooks are applied in sequence and tend to be expensive as most likely each of them does a separate content diff.
  25. Editors and Validators are commit hooks in disguise. They do a single content diff calling back to the respective implementations. This makes them considerably less expensive the raw commit hooks. However due to the call back based programming model they are more difficult to use.
  26. Observers are related to commit hooks. In fact both share the same signature as they get a before and an after state and can run a content diff on those processing the differences. However observers are run after the fact. That is, after the content has been persisted. Observers report what has happened while commit hooks report what might happen. Observers might or might not see each separate revision. After big cluster merges they might receive bigger chunks spanning over multiple individual commits. There might be no way to get to those individual commits as they might have already been garbage collected on the other cluster node. The important part is that each observer will see a monotonically increasing sequence of revisions.
  27. As JCR observation in Oak is based on observers and those might not get to see all commits, commit boundaries are usually lost. That is information attached to individual commits like the user or a time stamp are only available on a best effort basis in Oak. External index updates are e.g. used for integration Solr.
  28. Although search supports the same languages as Jackrabbit (SQL and XPath), the underlying implementations greatly differ. Oak has pluggable parsers and indexes. For each query a suitable parser is first searched and - if available - used to parse the query into an intermediate representation. Then all registered indexes are asked to provide a cost estimate for executing the query. Finally the query is executed choosing the cheapest index. If there is no suitable index for a query the synthetic "traverse" index will be used. This index, while very slow, allows each query to eventually succeed. Oak's approach to query execution is closer to the RDBMS world than Jackrabbit 2 as it allows for creating indexes to speed up queries.
  29. Indexes are specified in content via special index definition nodes. * Property index for properties of a given name. Can optionally by ordered. * Reference index for looking up referees of referenceable nodes * Lucene full text index for full text search. Stored in content replicates it automatically across cluster nodes and make it easy to back up along with the content. * Solr can be run embedded but usually runs externally
  30. * The MicroKernel implements the tree model * The NodeStore API exposes the tree model as immutable trees * Oak core implements mutable trees on top of the NodeStore API. * Plugins contribute much of the Oak core functionality like access control, validation, etc. through commit hooks and observers. * The Oak API exposes mutable trees, whose behaviour is shaped through the respective plugins configured. * JCR bindings contribute the right shaping for JCR by provisioning Oak with the right set of plugins. Other bindings besides JCR are possible. E.g. a HTTP binding as an alternative to the current WebDav implementation. Also new application could use the Oak API directly and only reuse the plugins as necessary for their needs.