SlideShare une entreprise Scribd logo
1  sur  36
No More SQL
A chronicle of moving a data repository from a
traditional relational database to MongoDB

Glenn Street
Database Architect, Copyright Clearance Center
Who am I?
●

Database Architect at Copyright Clearance Center

●

Oracle Certified Professional

●

Many years of database development and
administration

●

Learning to embrace “polyglot persistence”

●

Been working with MongoDB since version 1.6
What is Copyright Clearance Center?
"Copyright Clearance Center (CCC), the rights licensing expert, is a
global rights broker for the world’s most sought-after books, journals,
blogs, movies and more.
Founded in 1978 as a not-for-profit organization, CCC provides smart
solutions that simplify the access and licensing of content. These
solutions let businesses and academic institutions quickly get
permission to share copyright-protected materials, while
compensating publishers and creators for the use of their works."
www.copyright.com
What I want to talk about today
●
●

●

Not application design, but data management issues
Our experience in moving from "legacy" relational
data way of doing things
These experiences come from one large project
What do I mean by “data
management”?
●

Topics like naming conventions, data element
definitions

●

Data modeling

●

Data integration

●

Talking to legacy (relational) databases

●

Archive, purge, retention, backups
Where we started
●

200+ tables in a relational database

●

Core set of tables fewer, but many supporting tables

●

2.5 TB total (including TEMP space, etc.)

●

Many PL/SQL packages and procedures

●

Solr for search
Today
●
●

●

We use MongoDB in several products
The one I'll talk about today is our largest MongoDB
database (> 2 TB)
Live in production end of September
What options did we have in the past
for horizontal scaling?
●

At the database layer, few

●

Clustering ($$)

●

So, we emphasized scaling at the application tier

●

We wanted to be able to scale out the database tier in
a low-cost way
What kind of data?
●

"Work" data, primarily books, articles, journals

●

Associated metadata
–

Publisher, author, etc.
Application characteristics
●

●

Most queries are reads via Solr index
Database access is needed for additional metadata not
stored in Solr

●

Custom matching algorithms for data loads

●

Database updates are done in-bulk (loading)

●

Loads of data come from third-party providers

●

On top of this we've built many reports, canned and
ad-hoc
Here's what the core data model
looked like: highly normalized
Where we are today
●

●

12 MongoDB shards x 200 GB (2.4 TB) MongoDB
database
Replica sets, including hidden members for backup
(more about that later)

●

GridFS for data to be loaded

●

MMS for monitoring

●

JEE application (no stored procedure code)

●

Solr for search
What motivated us?
●

●

Downtime every time we made even the simplest
database schema update
The data model was not appropriate for our use case
–

Bulk loading (very poor performance)

–

Read-mostly (few updates)

–

We want to be able to see most of a "work's" metadata at
once

–

This lead to many joins, given our normalized data model
More motivators
●
●

●

●

Every data loader required custom coding
The business users wanted more control over adding
data to the data model “on-the-fly” (e.g., a new data
provider with added metadata)
This would be nearly impossible using a relational
database
MongoDB's flexible schema model is perfect for this
use!
What were our constraints?
●

●

●

Originally, we wanted to revamp the nature of how
we represent a work
Our idea was to construct a work made up of
varying data sources, a “canonical” work
But, as so often happens, time the avenger was not
on our side
We needed to reverse-engineer
functionality
●

●

●

●

This meant we needed to translate the relational
structures
We probably didn't take full advantage of a documentoriented database
The entire team was more familiar with the relational
model
Lesson:
–

Help your entire team get into the polyglot persistence
mindset
We came up with a single JSON
document
●

We weighed the usual issues:
–

●

Embedding vs. linking

Several books touch on this topic, as does the
MongoDB manual
–

One excellent one: MongoDB Applied Design Patterns
by Rick Copeland, O'Reilly Media.
We favored embedding
●
●

"Child" tables became "child" documents
This seemed the most natural translation of
relational to document

●

But, this led to larger documents

●

Lesson:
–

We could have used linking more
Example: one-to-one relationship
In MongoDB
work...
"publicationCountry" :
{
"country_code" : "CHE",
"country_description" : "Switzerland"
}
Example: one-to-many relationship
In MongoDB
An array of “work contributors”
"work_contributor" : [
{
"contributorName" : "Ballauri, Jorgji S.",
"contributorRoleDescr" : "Author",
},
{
"contributorName" : "Maxwell, William",
"contributorRoleDescr" : "Editor",
},...
]
When embedding...
●
●

●

Consider the resulting size of your documents
Embedding is akin to denormalization in the
relational world
Denormalization is not always the answer (even for
RDBMS)!
Data migration from our relational
database
●
●

●

Wrote a custom series of ETL processes
Combined Talend Data Integration and custom-built
code
Also leveraged our new loader program
But...we still had to talk to a relational
database
●

●

The legacy relational database became a reporting and batchprocess database (at least for now)
Data from our new MongoDB system of record needed to be
synced with the relational database
–

●

Wrote a custom process to transform the JSON structure back to
relational tables

Lesson:
–

Consider relational constraints when syncing from MongoDB to a
relational database
●

We had to account for some discrepancies in field lengths (MongoDB is more
flexible)
More Lessons Learned
●
●

Document size is key!
The data management practices you're used to from the relational
world must be adapted; example: key names

●

In the relational world, we favor longer names

●

We found that large key names were causing us pain
–

We're not the first: see “On shortened field names in MongoDB” blog post

–

But, this goes against “good” relational database naming practices (e.g.,
longer column names are self-documenting)
More Lessons Learned
●

Our way of using Spring Data introduced it's own
problems
–

●

“scaffolding”

Nesting of keys for flexibility was painful
Example:
workItemValues.work_createdUser.rawValue
Backups at this scale are challenging!
●

Mongodump and mongoexport were too slow for
our needs

●

Decided on hidden replica set members on AWS

●

Using filesystem snapshots for backups

●

Looking into MMS Backup service
Another Lesson: Non/SemiTechnical Users
●

For example, business analysts, product owners

●

Many know and like SQL

●

Many don't understand a document-oriented database

●

Engineering spent a lot of time and effort in raising
the comfort level
–

●

This was not universally successful

An interesting project, SQL4NoSQL
How to communicate structure?
Communicating Structure
●

Mind map was helpful initially

●

Difficult to maintain
JSON Schema
{"$schema": "http://json-schema.org/draft-03/schema",
"title": “Phase I Schema",
"description": "Describes the structure of the MongoDB database for Phase I",
"type":"object",
"id": "http://jsonschema.net",
"required":false,
"properties":{
"_id": {
"type":"string",
"required":false
},
...
JSON Schema for communicating
structure
●

I created a JSON schema representation of the
“work” document
–
–

●

●

JSON Schema
JSON Schema.net

Was used by QA and other teams for supporting
tools
JSON Schema also useful, but also cumbersome to
maintain
Next Steps/Challenges
●

Investigating on-disk (file system) compression
–

●

Very promising so far

Can we be more "document-oriented"?
–

Remove vestiges of relational data models

●

Implement an archiving and purging strategy

●

Investigating MMS Backup
Vote for these JIRA Items!
●

“Option to store data compressed“

●

“Bulk insert is slow in sharded environment”

●

“Tokenize the field names”

●

“Increase max document size to at least 64mb”

●

“Collection level locking”
Thanks!
●

Twitter: @GlennRStreet

●

Blog: http://glennstreet.net/

●

LinkedIn: http://www.linkedin.com/in/glennrstreet/

Contenu connexe

Tendances

An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBAn Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBWilliam LaForest
 
When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...MongoDB
 
NoSQL Slideshare Presentation
NoSQL Slideshare Presentation NoSQL Slideshare Presentation
NoSQL Slideshare Presentation Ericsson Labs
 
MongoDB- Crud Operation
MongoDB- Crud OperationMongoDB- Crud Operation
MongoDB- Crud OperationEdureka!
 
No sql distilled-distilled
No sql distilled-distilledNo sql distilled-distilled
No sql distilled-distilledrICh morrow
 
An Intro to NoSQL Databases
An Intro to NoSQL DatabasesAn Intro to NoSQL Databases
An Intro to NoSQL DatabasesRajith Pemabandu
 
When to Use MongoDB
When to Use MongoDBWhen to Use MongoDB
When to Use MongoDBMongoDB
 
Introduction to NoSQL with MongoDB
Introduction to NoSQL with MongoDBIntroduction to NoSQL with MongoDB
Introduction to NoSQL with MongoDBHector Correa
 
Jumpstart: MongoDB BI Connector & Tableau
Jumpstart: MongoDB BI Connector & TableauJumpstart: MongoDB BI Connector & Tableau
Jumpstart: MongoDB BI Connector & TableauMongoDB
 
NoSQL in the context of Social Web
NoSQL in the context of Social WebNoSQL in the context of Social Web
NoSQL in the context of Social WebBogdan Gaza
 
Introduction to MongoDB and CRUD operations
Introduction to MongoDB and CRUD operationsIntroduction to MongoDB and CRUD operations
Introduction to MongoDB and CRUD operationsAnand Kumar
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLRamakant Soni
 
[Mas 500] Data Basics
[Mas 500] Data Basics[Mas 500] Data Basics
[Mas 500] Data Basicsrahulbot
 
MongoDB introduction
MongoDB introductionMongoDB introduction
MongoDB introductionEdward Yoon
 

Tendances (20)

An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBAn Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDB
 
When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...
 
NoSQL Slideshare Presentation
NoSQL Slideshare Presentation NoSQL Slideshare Presentation
NoSQL Slideshare Presentation
 
NoSQL
NoSQLNoSQL
NoSQL
 
NoSQL
NoSQLNoSQL
NoSQL
 
MongoDB- Crud Operation
MongoDB- Crud OperationMongoDB- Crud Operation
MongoDB- Crud Operation
 
No sql distilled-distilled
No sql distilled-distilledNo sql distilled-distilled
No sql distilled-distilled
 
An Intro to NoSQL Databases
An Intro to NoSQL DatabasesAn Intro to NoSQL Databases
An Intro to NoSQL Databases
 
Mongo db
Mongo dbMongo db
Mongo db
 
When to Use MongoDB
When to Use MongoDBWhen to Use MongoDB
When to Use MongoDB
 
Introduction to NoSQL with MongoDB
Introduction to NoSQL with MongoDBIntroduction to NoSQL with MongoDB
Introduction to NoSQL with MongoDB
 
Jumpstart: MongoDB BI Connector & Tableau
Jumpstart: MongoDB BI Connector & TableauJumpstart: MongoDB BI Connector & Tableau
Jumpstart: MongoDB BI Connector & Tableau
 
NoSql Databases
NoSql DatabasesNoSql Databases
NoSql Databases
 
NoSQL in the context of Social Web
NoSQL in the context of Social WebNoSQL in the context of Social Web
NoSQL in the context of Social Web
 
Introduction to MongoDB and CRUD operations
Introduction to MongoDB and CRUD operationsIntroduction to MongoDB and CRUD operations
Introduction to MongoDB and CRUD operations
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
NoSQL and MongoDB
NoSQL and MongoDBNoSQL and MongoDB
NoSQL and MongoDB
 
[Mas 500] Data Basics
[Mas 500] Data Basics[Mas 500] Data Basics
[Mas 500] Data Basics
 
MongoDB introduction
MongoDB introductionMongoDB introduction
MongoDB introduction
 
10 mongo db
10 mongo db10 mongo db
10 mongo db
 

Similaire à No More SQL

Everything You Need to Know About MongoDB Development.pptx
Everything You Need to Know About MongoDB Development.pptxEverything You Need to Know About MongoDB Development.pptx
Everything You Need to Know About MongoDB Development.pptx75waytechnologies
 
how_can_businesses_address_storage_issues_using_mongodb.pptx
how_can_businesses_address_storage_issues_using_mongodb.pptxhow_can_businesses_address_storage_issues_using_mongodb.pptx
how_can_businesses_address_storage_issues_using_mongodb.pptxsarah david
 
how_can_businesses_address_storage_issues_using_mongodb.pdf
how_can_businesses_address_storage_issues_using_mongodb.pdfhow_can_businesses_address_storage_issues_using_mongodb.pdf
how_can_businesses_address_storage_issues_using_mongodb.pdfsarah david
 
Augmenting Mongo DB with Treasure Data
Augmenting Mongo DB with Treasure DataAugmenting Mongo DB with Treasure Data
Augmenting Mongo DB with Treasure DataTreasure Data, Inc.
 
Augmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataAugmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataTreasure Data, Inc.
 
MongoDB Jump Start
MongoDB Jump StartMongoDB Jump Start
MongoDB Jump StartHaim Michael
 
Which database should I use for my app?
Which database should I use for my app?Which database should I use for my app?
Which database should I use for my app?Nawaz Dhandala
 
MongoDB.local Sydney: An Introduction to Document Databases with MongoDB
MongoDB.local Sydney: An Introduction to Document Databases with MongoDBMongoDB.local Sydney: An Introduction to Document Databases with MongoDB
MongoDB.local Sydney: An Introduction to Document Databases with MongoDBMongoDB
 
Introduction to NoSQL and MongoDB
Introduction to NoSQL and MongoDBIntroduction to NoSQL and MongoDB
Introduction to NoSQL and MongoDBAhmed Farag
 
SQL vs MongoDB
SQL vs MongoDBSQL vs MongoDB
SQL vs MongoDBcalltutors
 
ГАННА КАПЛУН «noSQL vs SQL: порівняння використання реляційних та нереляційни...
ГАННА КАПЛУН «noSQL vs SQL: порівняння використання реляційних та нереляційни...ГАННА КАПЛУН «noSQL vs SQL: порівняння використання реляційних та нереляційни...
ГАННА КАПЛУН «noSQL vs SQL: порівняння використання реляційних та нереляційни...GoQA
 
Techorama - Evolvable Application Development with MongoDB
Techorama  - Evolvable Application Development with MongoDBTechorama  - Evolvable Application Development with MongoDB
Techorama - Evolvable Application Development with MongoDBbwullems
 

Similaire à No More SQL (20)

Mongo db operations_v2
Mongo db operations_v2Mongo db operations_v2
Mongo db operations_v2
 
Everything You Need to Know About MongoDB Development.pptx
Everything You Need to Know About MongoDB Development.pptxEverything You Need to Know About MongoDB Development.pptx
Everything You Need to Know About MongoDB Development.pptx
 
how_can_businesses_address_storage_issues_using_mongodb.pptx
how_can_businesses_address_storage_issues_using_mongodb.pptxhow_can_businesses_address_storage_issues_using_mongodb.pptx
how_can_businesses_address_storage_issues_using_mongodb.pptx
 
how_can_businesses_address_storage_issues_using_mongodb.pdf
how_can_businesses_address_storage_issues_using_mongodb.pdfhow_can_businesses_address_storage_issues_using_mongodb.pdf
how_can_businesses_address_storage_issues_using_mongodb.pdf
 
Database Workshop Slides
Database Workshop SlidesDatabase Workshop Slides
Database Workshop Slides
 
Augmenting Mongo DB with Treasure Data
Augmenting Mongo DB with Treasure DataAugmenting Mongo DB with Treasure Data
Augmenting Mongo DB with Treasure Data
 
Augmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure dataAugmenting Mongo DB with treasure data
Augmenting Mongo DB with treasure data
 
MongoDB Jump Start
MongoDB Jump StartMongoDB Jump Start
MongoDB Jump Start
 
Which database should I use for my app?
Which database should I use for my app?Which database should I use for my app?
Which database should I use for my app?
 
On no sql.partiii
On no sql.partiiiOn no sql.partiii
On no sql.partiii
 
Mongodb
MongodbMongodb
Mongodb
 
MongoDB.local Sydney: An Introduction to Document Databases with MongoDB
MongoDB.local Sydney: An Introduction to Document Databases with MongoDBMongoDB.local Sydney: An Introduction to Document Databases with MongoDB
MongoDB.local Sydney: An Introduction to Document Databases with MongoDB
 
Introduction to NoSQL and MongoDB
Introduction to NoSQL and MongoDBIntroduction to NoSQL and MongoDB
Introduction to NoSQL and MongoDB
 
SQL vs MongoDB
SQL vs MongoDBSQL vs MongoDB
SQL vs MongoDB
 
ГАННА КАПЛУН «noSQL vs SQL: порівняння використання реляційних та нереляційни...
ГАННА КАПЛУН «noSQL vs SQL: порівняння використання реляційних та нереляційни...ГАННА КАПЛУН «noSQL vs SQL: порівняння використання реляційних та нереляційни...
ГАННА КАПЛУН «noSQL vs SQL: порівняння використання реляційних та нереляційни...
 
MongoDB.pptx
MongoDB.pptxMongoDB.pptx
MongoDB.pptx
 
Techorama - Evolvable Application Development with MongoDB
Techorama  - Evolvable Application Development with MongoDBTechorama  - Evolvable Application Development with MongoDB
Techorama - Evolvable Application Development with MongoDB
 
Mongo db
Mongo dbMongo db
Mongo db
 
Mongodb
MongodbMongodb
Mongodb
 
Mongodb
MongodbMongodb
Mongodb
 

Dernier

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Dernier (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

No More SQL

  • 1. No More SQL A chronicle of moving a data repository from a traditional relational database to MongoDB Glenn Street Database Architect, Copyright Clearance Center
  • 2. Who am I? ● Database Architect at Copyright Clearance Center ● Oracle Certified Professional ● Many years of database development and administration ● Learning to embrace “polyglot persistence” ● Been working with MongoDB since version 1.6
  • 3. What is Copyright Clearance Center? "Copyright Clearance Center (CCC), the rights licensing expert, is a global rights broker for the world’s most sought-after books, journals, blogs, movies and more. Founded in 1978 as a not-for-profit organization, CCC provides smart solutions that simplify the access and licensing of content. These solutions let businesses and academic institutions quickly get permission to share copyright-protected materials, while compensating publishers and creators for the use of their works." www.copyright.com
  • 4. What I want to talk about today ● ● ● Not application design, but data management issues Our experience in moving from "legacy" relational data way of doing things These experiences come from one large project
  • 5. What do I mean by “data management”? ● Topics like naming conventions, data element definitions ● Data modeling ● Data integration ● Talking to legacy (relational) databases ● Archive, purge, retention, backups
  • 6. Where we started ● 200+ tables in a relational database ● Core set of tables fewer, but many supporting tables ● 2.5 TB total (including TEMP space, etc.) ● Many PL/SQL packages and procedures ● Solr for search
  • 7. Today ● ● ● We use MongoDB in several products The one I'll talk about today is our largest MongoDB database (> 2 TB) Live in production end of September
  • 8. What options did we have in the past for horizontal scaling? ● At the database layer, few ● Clustering ($$) ● So, we emphasized scaling at the application tier ● We wanted to be able to scale out the database tier in a low-cost way
  • 9. What kind of data? ● "Work" data, primarily books, articles, journals ● Associated metadata – Publisher, author, etc.
  • 10. Application characteristics ● ● Most queries are reads via Solr index Database access is needed for additional metadata not stored in Solr ● Custom matching algorithms for data loads ● Database updates are done in-bulk (loading) ● Loads of data come from third-party providers ● On top of this we've built many reports, canned and ad-hoc
  • 11. Here's what the core data model looked like: highly normalized
  • 12. Where we are today ● ● 12 MongoDB shards x 200 GB (2.4 TB) MongoDB database Replica sets, including hidden members for backup (more about that later) ● GridFS for data to be loaded ● MMS for monitoring ● JEE application (no stored procedure code) ● Solr for search
  • 13. What motivated us? ● ● Downtime every time we made even the simplest database schema update The data model was not appropriate for our use case – Bulk loading (very poor performance) – Read-mostly (few updates) – We want to be able to see most of a "work's" metadata at once – This lead to many joins, given our normalized data model
  • 14. More motivators ● ● ● ● Every data loader required custom coding The business users wanted more control over adding data to the data model “on-the-fly” (e.g., a new data provider with added metadata) This would be nearly impossible using a relational database MongoDB's flexible schema model is perfect for this use!
  • 15. What were our constraints? ● ● ● Originally, we wanted to revamp the nature of how we represent a work Our idea was to construct a work made up of varying data sources, a “canonical” work But, as so often happens, time the avenger was not on our side
  • 16. We needed to reverse-engineer functionality ● ● ● ● This meant we needed to translate the relational structures We probably didn't take full advantage of a documentoriented database The entire team was more familiar with the relational model Lesson: – Help your entire team get into the polyglot persistence mindset
  • 17. We came up with a single JSON document ● We weighed the usual issues: – ● Embedding vs. linking Several books touch on this topic, as does the MongoDB manual – One excellent one: MongoDB Applied Design Patterns by Rick Copeland, O'Reilly Media.
  • 18. We favored embedding ● ● "Child" tables became "child" documents This seemed the most natural translation of relational to document ● But, this led to larger documents ● Lesson: – We could have used linking more
  • 20. In MongoDB work... "publicationCountry" : { "country_code" : "CHE", "country_description" : "Switzerland" }
  • 22. In MongoDB An array of “work contributors” "work_contributor" : [ { "contributorName" : "Ballauri, Jorgji S.", "contributorRoleDescr" : "Author", }, { "contributorName" : "Maxwell, William", "contributorRoleDescr" : "Editor", },... ]
  • 23. When embedding... ● ● ● Consider the resulting size of your documents Embedding is akin to denormalization in the relational world Denormalization is not always the answer (even for RDBMS)!
  • 24. Data migration from our relational database ● ● ● Wrote a custom series of ETL processes Combined Talend Data Integration and custom-built code Also leveraged our new loader program
  • 25. But...we still had to talk to a relational database ● ● The legacy relational database became a reporting and batchprocess database (at least for now) Data from our new MongoDB system of record needed to be synced with the relational database – ● Wrote a custom process to transform the JSON structure back to relational tables Lesson: – Consider relational constraints when syncing from MongoDB to a relational database ● We had to account for some discrepancies in field lengths (MongoDB is more flexible)
  • 26. More Lessons Learned ● ● Document size is key! The data management practices you're used to from the relational world must be adapted; example: key names ● In the relational world, we favor longer names ● We found that large key names were causing us pain – We're not the first: see “On shortened field names in MongoDB” blog post – But, this goes against “good” relational database naming practices (e.g., longer column names are self-documenting)
  • 27. More Lessons Learned ● Our way of using Spring Data introduced it's own problems – ● “scaffolding” Nesting of keys for flexibility was painful Example: workItemValues.work_createdUser.rawValue
  • 28. Backups at this scale are challenging! ● Mongodump and mongoexport were too slow for our needs ● Decided on hidden replica set members on AWS ● Using filesystem snapshots for backups ● Looking into MMS Backup service
  • 29. Another Lesson: Non/SemiTechnical Users ● For example, business analysts, product owners ● Many know and like SQL ● Many don't understand a document-oriented database ● Engineering spent a lot of time and effort in raising the comfort level – ● This was not universally successful An interesting project, SQL4NoSQL
  • 30. How to communicate structure?
  • 31. Communicating Structure ● Mind map was helpful initially ● Difficult to maintain
  • 32. JSON Schema {"$schema": "http://json-schema.org/draft-03/schema", "title": “Phase I Schema", "description": "Describes the structure of the MongoDB database for Phase I", "type":"object", "id": "http://jsonschema.net", "required":false, "properties":{ "_id": { "type":"string", "required":false }, ...
  • 33. JSON Schema for communicating structure ● I created a JSON schema representation of the “work” document – – ● ● JSON Schema JSON Schema.net Was used by QA and other teams for supporting tools JSON Schema also useful, but also cumbersome to maintain
  • 34. Next Steps/Challenges ● Investigating on-disk (file system) compression – ● Very promising so far Can we be more "document-oriented"? – Remove vestiges of relational data models ● Implement an archiving and purging strategy ● Investigating MMS Backup
  • 35. Vote for these JIRA Items! ● “Option to store data compressed“ ● “Bulk insert is slow in sharded environment” ● “Tokenize the field names” ● “Increase max document size to at least 64mb” ● “Collection level locking”