This talk will explore how we REBOOTED our Project Design. After a decade of production usage, the RavenDB team addressed a lot of ongoing concerns & changed some of RavenDB's core architecture.
We'll investigate the driving forces behind it, the reasoning process & look at how it all turned out.
2. JOEL SPOLSKY
“…the single worst strategic
mistake that any software
company can make: They decided
to rewrite the code from scratch.”
3. RAVENDB 1.0
• Design & Development started around 2009
• First production deployment – 2010
• Esent for storage
• Lucene for indexing
• MVC .NET architecture
• Focused on time to market
4. RAVENDB 2.0
• Released 2013
• Silverlight UI
• New storage engine for testing
only - Munin
• Lots of tiny feature, no
architectural changes
• Focus on features
• Perf
5. RAVENDB 3.0
• Released 2014
• HTML5 Studio
• Introduced own storage engine for
production use – Voron
• Re-architect to work on OWIN
• I/O prefetching
• Own thread pool
• Focus on distributed work
• Perf
6. JAN 2015
• Two sprints dedicated to performance tuning
• Get a profiler, whole team effort
• Started to pay attention to assembly output
• Major performance gains
7. SUPPORT
• We are a database…
• Complex deployments
• Harsh environment
• 24/7
• High availability
• Lots of data, lots of requests
• High sensitivity to hiccups
8. Q3 2015 – WHAT CAN WE DO?
• Backward compatibility
• Deployed to (literally in the literal
sense) millions of nodes world wide at
various customer sites
• Existing technical debt
• Can’t run on Linux
• Running on tech stack that is not fully
owned by us
• Existing complexity
11. EFFECT?
• We have a decade+ of knowing how our software is used
• Hindsight
• Where does it hurts?
• UX study
• Deep dive with customers
• Going over support incident reports
12. ARCHITECTURAL DECISIONS
• Multiple OS
• OWN the stack
• Build for performance
• Build of operations
• Support as a key consideration
• Key scenarios impact whole system
design
13. WE ARE A JSON DOCUMENT DB…
• JSON Parsing • Parsing means:
• Text parsing
• Allocating managed memory
• Reading data from disk
14. GOING ALL THE WAY DOWN…
• JSON documents are stored using
blittable format.
• No parsing / manipulation required to
process
• .NET representation:
• byte* ptr, int len
• Storage engine: Voron
• Zero copy
• Memory mapped
• Can search a give value and give result:
• byte* ptr, int len
16. RESULT?
• Zero copy throughout the process
• No parsing costs
• Therefor, reduced CPU
• Therefor, no need to cache in memory
• Therefor, reduced managed memory
• Memory mapped
• Therefor, OS already keep in memory
• Avoid duplicate data caching
• Reduce memory consumption
• Lean on OS for eviction
17. MEMORY MANAGEMENT
• Need to own that, important for perf!
• Reduce managed allocations
• Use unmanaged memory and take
advantage of internal knowledge
• Simple solutions can work well, in well
defined context
• Arena Allocator
• Context
18. DESIGN FOR THE DEBUGGER
• Support burden optimization
• Single threaded execution for long running tasks
• Named threads
• Build data structure for analysis in dumps
• Singular architecture focus
20. REMOVING PROBLEMS
• Some things are known to be issues.
• Eliminate completely if possible.
• Authentication sample:
• Windows Authentication
• Who’s the admin?
• X509 Client Certificates
• openssl s_client -connect
• Fallacies of distributed computing
21. WHY NOT REWRITE IT IN RUST?
• Rust
• Go
• C++
• C
• Erlang
• Elixir
• Big bet on .NET Core
• Team familiar with .NET
• Even with added complications of
design, tooling and language support
are very good
• Cross platform now .
• Core values aligned with CoreCLR in
terms of perf / support.
On the table…
22. HOW WE ACTUALLY DID IT?
Indexes
Queries
Indexing
Query Optimizer
Storage
Cluster
•Replication
•ETL
Documents
•Operations
•Load
Client API
C#
•Go
Java
•Ruby
Python
•Go
23. PARALLEL VERSION DEVELOPMENT
• 3.5 released while working on 4.0
• Identify bottlenecks in the design
• As infrastructure completes, parallelize
• Slowly transition team members to the new release
• Prioritizing demo-abiblity of the system
• 30% of the team dedicated to the UI
24. WHERE ARE WE NOW?
• Initially budgeted for 1 year
development
• Team size: 25
• Started Sep 2015
• OMG, we have so many features!
• Mid 2016 changed schedule to June
2017
• Released 3.5 while working on 4.0
• Ramp up time internally
• Supporting older versions
simultaneously
• Actual release – Feb 2018
• Some features cut
• Actual completion of all planned
features
• Aug 2018 – (3 years!)
• With a lot extras, of course
25. IMPACT ON SUPPORT?
• Support call time reduced
• Typical 3.x tier 2 support call:
1 week
• Typical 4.x tier 2 support call:
2 hours
28. END RESULT?
• However:
• 14 months overdue
• Deadline extended twice
• Much larger in scope than expected
• Even when took this to account
• Features in 3.0 are only in the 4.1
release, expected next month
• 20 months over expected time
• Mostly minor, though
• Full support by whole team and
company essential