Rebooting design in RavenDB

REBOOTING DESIGN
Oren Eini
oren@ravendb.net

JOEL SPOLSKY
“…the single worst strategic
mistake that any software
company can make: They decided
to rewrite the code from scratch.”

RAVENDB 1.0
• Design & Development started around 2009
• First production deployment – 2010
• Esent for storage
• Lucene for indexing
• MVC .NET architecture
• Focused on time to market

RAVENDB 2.0
• Released 2013
• Silverlight UI
• New storage engine for testing
only - Munin
• Lots of tiny feature, no
architectural changes
• Focus on features
• Perf

RAVENDB 3.0
• Released 2014
• HTML5 Studio
• Introduced own storage engine for
production use – Voron
• Re-architect to work on OWIN
• I/O prefetching
• Own thread pool
• Focus on distributed work
• Perf

JAN 2015
• Two sprints dedicated to performance tuning
• Get a profiler, whole team effort
• Started to pay attention to assembly output
• Major performance gains

SUPPORT
• We are a database…
• Complex deployments
• Harsh environment
• 24/7
• High availability
• Lots of data, lots of requests
• High sensitivity to hiccups

Q3 2015 – WHAT CAN WE DO?
• Backward compatibility
• Deployed to (literally in the literal
sense) millions of nodes world wide at
various customer sites
• Existing technical debt
• Can’t run on Linux
• Running on tech stack that is not fully
owned by us
• Existing complexity

REWRITE!
(& BACKWARD COMPACT!)
September 2015
X 10
Perf
Simpler
to support
Cross
Platform

EFFECT?
• We have a decade+ of knowing how our software is used
• Hindsight
• Where does it hurts?
• UX study
• Deep dive with customers
• Going over support incident reports

ARCHITECTURAL DECISIONS
• Multiple OS
• OWN the stack
• Build for performance
• Build of operations
• Support as a key consideration
• Key scenarios impact whole system
design

WE ARE A JSON DOCUMENT DB…
• JSON Parsing • Parsing means:
• Text parsing
• Allocating managed memory
• Reading data from disk

GOING ALL THE WAY DOWN…
• JSON documents are stored using
blittable format.
• No parsing / manipulation required to
process
• .NET representation:
• byte* ptr, int len
• Storage engine: Voron
• Zero copy
• Memory mapped
• Can search a give value and give result:
• byte* ptr, int len

RESULT?
• Zero copy throughout the process
• No parsing costs
• Therefor, reduced CPU
• Therefor, no need to cache in memory
• Therefor, reduced managed memory
• Memory mapped
• Therefor, OS already keep in memory
• Avoid duplicate data caching
• Reduce memory consumption
• Lean on OS for eviction

MEMORY MANAGEMENT
• Need to own that, important for perf!
• Reduce managed allocations
• Use unmanaged memory and take
advantage of internal knowledge
• Simple solutions can work well, in well
defined context
• Arena Allocator
• Context

DESIGN FOR THE DEBUGGER
• Support burden optimization
• Single threaded execution for long running tasks
• Named threads
• Build data structure for analysis in dumps
• Singular architecture focus

REPORTING PROBLEMS
• Constant monitoring
• Act or alert accordingly

REMOVING PROBLEMS
• Some things are known to be issues.
• Eliminate completely if possible.
• Authentication sample:
• Windows Authentication
• Who’s the admin?
• X509 Client Certificates
• openssl s_client -connect
• Fallacies of distributed computing

WHY NOT REWRITE IT IN RUST?
• Rust
• Go
• C++
• C
• Erlang
• Elixir
• Big bet on .NET Core
• Team familiar with .NET
• Even with added complications of
design, tooling and language support
are very good
• Cross platform now .
• Core values aligned with CoreCLR in
terms of perf / support.
On the table…

HOW WE ACTUALLY DID IT?
Indexes
Queries
Indexing
Query Optimizer
Storage
Cluster
•Replication
•ETL
Documents
•Operations
•Load
Client API
C#
•Go
Java
•Ruby
Python
•Go

PARALLEL VERSION DEVELOPMENT
• 3.5 released while working on 4.0
• Identify bottlenecks in the design
• As infrastructure completes, parallelize
• Slowly transition team members to the new release
• Prioritizing demo-abiblity of the system
• 30% of the team dedicated to the UI

WHERE ARE WE NOW?
• Initially budgeted for 1 year
development
• Team size: 25
• Started Sep 2015
• OMG, we have so many features!
• Mid 2016 changed schedule to June
2017
• Released 3.5 while working on 4.0
• Ramp up time internally
• Supporting older versions
simultaneously
• Actual release – Feb 2018
• Some features cut
• Actual completion of all planned
features
• Aug 2018 – (3 years!)
• With a lot extras, of course

IMPACT ON SUPPORT?
• Support call time reduced
• Typical 3.x tier 2 support call:
1 week
• Typical 4.x tier 2 support call:
2 hours

PERFORMANCE
• 100,000+ writes / second
• 1,000,000 reads / second
• < 1,000$ machine
• On the wild
• X 20
• X 52

PLATFORMS
• Windows
• Linux
• Arm (Raspberry PI)
• Mac OS X
• Production deployed to ARM in
industrial settings

END RESULT?
• However:
• 14 months overdue
• Deadline extended twice
• Much larger in scope than expected
• Even when took this to account
• Features in 3.0 are only in the 4.1
release, expected next month
• 20 months over expected time
• Mostly minor, though
• Full support by whole team and
company essential

Rebooting design in RavenDB

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Rebooting design in RavenDB

Similaire à Rebooting design in RavenDB (20)

Plus de Oren Eini

Plus de Oren Eini (7)

Dernier

Dernier (20)

Rebooting design in RavenDB