Clojure has been heralded as a pioneer in data oriented functional programming. In this talk, Huahai will explore the use of Clojure data diffing/patching library as a tool to simplify software architecture and solve complex engineering problems. After briefly describing EditScript, a Clojure data diffing/patching library, he will detail several usage patterns by drawing from code examples in our production system.
Huahai will discuss how diffing improves system modularization by reducing namespace dependencies; how it drastically simplifies client-server communication to drive much faster UI iterations; how it enables massive scaling by turning stateful applications into stateless ones; and how it powers collaborative editing of online documents.
This talk is for everyone who are interested in expanding their data oriented functional programming tool box.
2. What is diffing?
• Given two elements a and b,calculate the difference d between
them
• Function (diff a b) ;=> d
• Function (patch a d)
• Such that (= b (patch a d))
• Or: (= b (patch a (diff a b)))
• These are normally true:
• (not= (diff a b) (diff b a))
• (= (diff a c) (concat (diff a b) (diff b c)))
• (< (size d) (min (size a) (size b)))
• (< (time (patch a d)) (time (diff a b)))
3. Evolution of diffing (1)
• Earliest diff was developed by Doug
McIIroy on Unix at Bell Lab in 1974
• Works on text file, work units are lines
of text
• Purpose: Reduce storage necessary to
maintain multiple versions of file.
• Use: compare content, track changes,
verifying output, version control
4. Evolution of diffing (2)
• Diffing in 3D graphics programming
• World modeled as a scene graph
• Only re-render changed subtrees
• Purpose: performance optimization
• Conceptually simple programming
model: render everything
• Inspired react.js
• Clojurescript wrapper of react could
be faster than react due to faster
diffing with immutable data
5. Evolution of diffing (3)
• Data oriented programming
• Data, not text
• Data are directly meaningful for code, no need for parsing or decoding
• Generic data literals, not specialized opaque programming constructs
• Diff input and output are both data
• Diffing as a software architecture consideration, not just an
implementation detail, impacting
• Delineation of system components
• Data model design
• API design
6. Diffing enables decoupling
• diff & patch functions are generic and blind
• They don't have to understand their input for them to work
• Semantic asymmetry between sender and receiver enforces separation of
concerns
• Also support a kind of natural encapsulation, not forced like in OOP
• d is still open for inspection if the receiver chooses to
• Graded, receiver don’t need know a lot, but can know a lot if choose to
Sender
(diff a a’) ;=> d
d
Receiver
(patch a d) ;=> a’
7. Diffing encourages data model reuse
• Thanks to diffing, data duplication between components are faithful and
cheap
• Advantageous to reuse the same data model throughout the system,
dramatically simplifying system
8. Diffing tracks changes
• Thanks to diffing, each version of the
world state can be cheaply saved
and replayed to recover originals
• Application statefulness can be
externalized and managed
9. Editscript: a Clojure data
diffing library
• https://github.com/juji-io/editscript
• Works for vector, list, set and map
• Edits are a vector of vectors:
• Path
• Op :+, :-, or :r
• Value
• Diffing algorithms
• Quick: fast
• A* : optimal diff size
10. Case study: Juji Studio UI Re-design
• Complete UI redesign
• Re-implementation
• One month
turnaround
• Mainly due to
switching from a
resource-oriented API
to a diffing based API
Before
11. Case study: Juji Studio UI Re-design
• Complete UI redesign
• Re-implementation
• One month
turnaround
• Mainly due to
switching from a
resource-oriented API
to a diffing based API
After
12. UI Data model: config doc
• Single Page Application (SPA) in cljs
• States in an EDN document – config doc
• SPA, server and DB all having copies of
config doc
Config
doc
SPA Server DB
GraphQL
Config
doc
Config
docAPI
13. Traditional GraphQL API
• Resources oriented
(RESTful)
• Server side config doc is
the truth
• API is CRUD on server
resources
• i.e. paths in the config
doc
• Repetitive CRUD calls for
each and every type of
nodes
• Thousands lines of Lacinia
schema
14. Diffing based GraphQL API
• All logic is in SPA
• API is CRUD on config doc
• Update is sending diffs
• SPA periodically sends to
server:
(diff doc-prev doc-now)
• Server applies the diff, saves
the doc in DB, replies with
config doc SHA
• SPA validates SHA, if
different, sends config doc
to overwrite
• Removed all API calls on
paths and nodes
15. Case study: externalize application states
• How to scale highly stateful application?
• E.g. Juji initiates an agent (rep) for each chat session on a server node, the
state of each rep is stored in an atom
• What if the server node become unavailable?
Server Node
API
Gateway
16. Case study: externalize application states
• Each rep sends diff of its state to a persistent log (e.g. Kafka)
• E.g. At each utterance, rep sends (diff state-prev state-now)
• When a server becomes unavailable, API gateway forward traffic to
another server, which recovers the agent state from the persistent
log, by simply sequentially applying all diffs to a shared initial state.
Server Node
API
Gateway
Persistent Log
diff
17. Case study: reduce component dependency
• Stateful components depend on one another
• Introducing user invokable system functions,
leads to circular dependency, e.g.
(juji.func.system/cleanup-chat rep)
System
Rep
Reps
Rep
Subs
func.system
[:rt jujiid]
18. • Instead of depending on
namespaces that contain
subscriptions
• Watch reps atom
• Inspect its diff between old
and new
• Handle the case when a rep
is removed or cleaned
• i.e. sending :user-left
message to channels, and let
the subscriptions clean
themselves up
19. Case study: synchronize collaborative editing
• Multiple parties sending diffs
• Out of sync when lines cross path
• Difficult yet common problem
• E.g. enable multiple users editing the same
chat at the same time
• Locking has bad UX
• Three-way merge has high latency
A A
(diff A A’)
(diff A A’’)
20. Differential Synchronization
• Diffing based synchronization
method
• Scalable
• Fault-tolerant
• Low latency
• Developed by Neil Fraser in
2009
• Used by Google Docs
24. Data modeling guideline: Don’t use vector
• Minimize unnecessary use of ordered data structure, e.g. vector or
list
• Diffing algorithm is slow for ordered data, because order is a strong
constraint to satisfy
• Ordered O(mn) vs. Unordered O(m+n)
• The implicit order of data elements are often source of incidental complexity
• Meaningful order is often based on data fields
• Sets or maps suffice in most cases
[ {} {} {} … ]
Bad
{ {} {} {} … } #{ {} {} {} … }
Good
25. Conclusion
• Diffing offers a few properties that lead to
• Simplified software architecture
• Enhanced system decoupling
• Easier scaling of stateful application
• Better solution to data synchronization problem
• Worthwhile to consider diffing based software architecture
• Particularly for data-oriented programming