The document summarizes the evolution of the Guardian's content management system from early relational databases and bespoke solutions in the 1990s to modern approaches using NoSQL document databases. It describes periods relying on vendor CMS platforms, monolithic Java applications with Oracle databases, and attempts to scale with partial NoSQL solutions like Solr and memcached. The document advocates a full transition to document stores like MongoDB by migrating systems like user identity management and designing flexible JSON data models instead of relational schemas and ORM complexity.
6. Mid Period: 2000s (The “Vendor CMS era”)
Vignette / AOLserver
TCL, Apache, Oracle
Platform for online
publishing
Initially scales well with
acceleration in delivery
of features
7. Mid Period: 2000s (The “Vendor CMS era”)
Surprise! Vendor’s CMS
doesn’t do what we want!
Mish-mash in templates:
HTML, JavaScript, TCL,
SQL, PL-SQL
No model in app tier, only
in RDBMS schema created
in Oracle Designer
10. Mid Period: 2000s (The “Vendor CMS era”)
After a few years, very
difficult to extend
Database schema
becomes fixed due to
dependencies in
templates
11. Mid Period: 2000s (The “Vendor CMS era”)
If you can’t change the
system:
12. Modern Period
circa ’05-09
The “J2EE Monolithic” era
13.
14. Web server Web server Web server
I bring you NEWS!!!
App server App server App server
Oracle
CMS Data feeds
15. Web server Web server Web server
Modern java app
I bring you NEWS!!!
App server App server App server
Spring / Hibernate
DDD / TDD
Strong Oracle in java
model
Database abstracted away with ORM
CMS Data feeds
18. Complexity still increasing:
300+ tables,
10,000 lines of hibernate XML config
1,000 domain objects mapped to database
70,000 lines of domain object code
Very tight binding to database
19. ORM not really masking complexity:
Database has strong influence on domain model: many
domain objects made more complex mapping joins in
RDBMS
Complex hibernate features used, interceptors, proxies
Complex caching strategy
20. ORM not really masking complexity:
Database has strong influence on domain model: many
domain objects made more complex mapping joins in
RDBMS
Complex hibernate features used, interceptors, proxies
Complex caching strategy
And:
We still hand code most queries in SQL!
22. Partial NoSQL
circa ’09-10
The “Sticking Plaster” era
23. Introduce yet more caching to patch up load problems
Text
Introduction of memcached
24. Decouple applications from database by building APIs
Power APIs using alternative, more scalable technologies
APIs used to scale out database reads
Writes still go to RDBMs
25. Mutualised news!
Content API
http://content.guardianapis.com
Read only access to 10 years worth of content
26. Core
Api
Web servers
Solr/API
App servers
Solr/API
Memcached (20Gb)
Solr/API
oracle Solr
Solr/API
Solr/API
CMS Cloud, EC2
32. Mutualised news! understandable
JSON API is simple and
Multiple domain concepts expressed in single document
Can be designed in forwardly extensible way
33. Mutualised news! understandable
JSON API is simple and
Multiple domain concepts expressed in single document
Can be designed in forwardly extensible way
What if the JSON API was our primary model?
34. Full NoSQL
in development
The “It’s the future!” era
35. MongoDB
Mutualised news! database
Document oriented
Stores parsed JSON documents
Can express complex queries
Can be flexible about consistency
Malleable schema: can easily change at runtime
Can work at both large & small scales
38. Flexible Schema
Mutualised news!
Can easily represent different classes of tag as
documents
Both documents can be inserted into
same collection
Far simpler than equivalent hibernate
mapped subclass configuration
44. The first project: Identity
Current login/registration system still in TCL/PL-SQL
3M+ users in relational database
Very complex schema + PL-SQL
New system required
Can we migrate from Oracle to NoSql?
45. Build API that can support both backends
Registration app guardian.co.uk
API This bit is hard!
Oracle
46. Build API that can support both backends
Registration app guardian.co.uk
API
MongoDB Oracle
47. Migrate using API & decommision
Registration app guardian.co.uk
API
MongoDB
48. Add new stuff!
Registration app guardian.co.uk
API
MongoDB Solr? Redis?
49. Summary: Evolving to Document Store
Relational databases are dead
Ruthlessly reject relational complexity
Think hard about your json model
Plan your evolution
http://content.guardianapis.com
graham.tackley@guardian.co.uk - @tackers
Notes de l'éditeur
g.co.uk: leading liberal news website40m uniques p/m, half from US\n
publishing newspaper for nearly 200 years; website for 15 yrs\nadapting to change is critical - start with history of fighting relational dbs & attempts to tame; then how we’re moving to mongo\n
\n
Ancient system\nScripts & database\nBespoke software, changes difficult\n
\n
Site oriented to broadcast publishing model\nCMS helps. No longer lashing things together \n\n
Template & rdbms oriented design, and TCL = no real domain model\nHeavyweight schema change process\n\n
This is from a TEMPLATE!\nscroll down to reveal HTML\n(about 10,000 of these)\n
bottom of template\nabout 10,000 of these!\n\n
Can’t change schema easily, to many dependencies in templates\n\n
dodo\ne.g. at start just articles; now video, interactives, audio, galleries, live blogs...\n
Very standard 3 tier application\nScale application servers on load\nCaching local to application server at first. Memcached added later\nRead heavy, broadcast model. Almost no writes compared to reads\n\n
Very standard 3 tier application\nScale application servers on load\nCaching local to application server at first. Memcached added later (in next era!)\nRead heavy, broadcast model. Almost no writes compared to reads\n\n
\n
\n
\n
\n
\n
Talk: beginning to use NoSql in real organisation. Change in journalism affecting platform\n\n
My team spent 2 yrs making relational db inconsitant:\nmuch better to deliever 30 sec old data in 200ms, than up to data data in 5 secs\nWe don’t have a scale problem with current application & model\n\n\n
Talk: beginning to use NoSql in real organisation. Change in journalism affecting platform\n\n
Content api - both for external people (engaging with the internet) and for us. \nMost new stuff from content api, e.g. m.guardian, iPhone, iPad etc.\n
Introduction of memcached & Solr\nSolr hosted in the cloud (EC2)\n
“Out” service\n
\n
Most recent content\n\n
Most recent content with tags, fields\n(this is pretty well how we went live with the content api)\n\n\n
Single article with media\nExtensible schema, eg: adding geotagging to images. Hard in DB, easy in JSON\nThis document represents at least 30 database tables!\n\n\n
\n\n
\n
Not a million miles from a RDBMS\nSimpler\n
Experiments with mongodb & content API\nGuardian site categorises content with tags\nTone tag represents “editorial tone” of content\n\n
Different tag types can have different schemas\nKeywords (subjects) are in a section, music / madonna\n\n
\n\n
\n\n
\n\n
Suppose we want to add external musicbrainz ID to tag?\nAn update can modify the schema at runtime. No downtime.\n\n
Where clause: id\n$push atomically ads external reference onto tag\n\n
Resulting document now looks like this\nWe haven’t had to change the code yet - old code still works, no downtime\n\n
Migration project, not green fields\n(SKIP IF LESS THAN 10 MINS TO GO!)\n\n
REST API\nMapped initially just to oracle, then (next slide) to both datastores\nIntegration tested\n\n
API supports both data stores - lazy migration\nCurrently writing this in Scala + Casbah - so far 60-70% less code for mongo version\n\n
Then batch migration and bye bye oracle\n
In the future?\n\n
1. we will never use a relational database for a new project\n2. kill: orm; memcached etc\n3. spend the time thinking about how your problem space should be defined\n4. things will change - plan for it!\n