The document describes MongoDB's directory layout and data structures. Key points:
- MongoDB stores each database in its own set of preallocated files containing extents.
- Data is memory mapped from files into virtual memory for fast access.
- The journal logs all operations before writing to data files for durability.
- Collections are divided into fixed-size extents containing records with BSON documents.
- Indexes are also stored in extents containing index records that point to documents.
2. Directory Layout
• Separate files per database
• Aggressive preallocation
• Files contain one or more extents
-rw------- 1 ben ben 64M May 1 19:14 test.0!
-rw------- 1 ben ben 128M May 1 19:14 test.1!
-rw------- 1 ben ben 256M May 1 18:25 test.2!
-rw------- 1 ben ben 512M May 1 19:14 test.3!
-rw------- 1 ben ben 1.0G May 1 19:14 test.4!
-rw------- 1 ben ben 2.0G May 1 18:58 test.5!
-rw------- 1 ben ben 16M May 1 19:14 test.ns!
2
4. Data Structures
• DiskLoc
• Stores file number and offset of data on disk
• Record *r = mmap base + DiskLoc.offset!
• Max offset is 2^31 (2GB)!
• NamespaceDetails
• Stores collection metadata!
• Extent!
• Stores contiguous blocks within a namespace
• Max extent size is 2GB
• Record!
• Holds a BSON document or B-tree bucket
• DeletedRecord overwrites a Record!
• Includes Padding
5. Namespace Details
• Holds metadata about a collection or index
• Stored in 1KB buckets in <dbname>.ns file
• .ns file fixed size of 16MB
• Maintains document count
• Contains heads of linked lists
NamespaceDetails
firstExtent
lastExtent
_indexes[]
stats
freeList[]
9. Extents and Records
Extent
length
xNext
Data
Record
xPrev
length
Document
{
rNext
firstRecord
_id:
“foo”,
...
rPrev
}
lastRecord
10. Extents and Records
Extent
length
xNext
Data
Record
xPrev
length
Document
{
rNext
firstRecord
_id:
“foo”,
...
rPrev
}
lastRecord
11. Extents and Records
Extent
length
xNext
Data
Record
Data
Record
xPrev
length
Document
length
Document
{
{
rNext
rNext
firstRecord
_id:
“foo”,
_id:
“foo”,
...
...
rPrev
}
rPrev
}
lastRecord
12. BSON Format
{
hello:
“world”
}
Doc
Length
Value
Type
x16x00x00x00 x02hellox00 !
x06x00x00x00 worldx00x00!
Value
Length
13. Index Extents
Extent
length
Index
Record
Index
Record
xNext
xPrev
length
Bucket
length
Bucket
parent
parent
rNext
rNext
firstRecord
numKeys
numKeys
rPrev
K
rPrev
lastRecord
{
Document
}
14. Index Extents
4
9
1
3
5
6
8
A
B
Extent
length
Index
Record
Index
Record
xNext
xPrev
length
Bucket
length
Bucket
parent
parent
rNext
rNext
firstRecord
numKeys
numKeys
rPrev
K
rPrev
lastRecord
{
Document
}
15. Journaling
• Write ahead logging
• Operations written to journal before memory
mapped regions
• Private view
• Shared view
• Once journal written, data safe unless
hardware problem
• By default, journal flushed every 100ms,
100mb of writes, or on write concern of j=true
• User configurable with --journalCommitInterval
16. Journal Format
JHeader
• Section
contains
single
group
commit
JSectHeader
[LSN
3]
• Applied
all-‐or-‐nothing
DurOp
DurOp
DurOp
Op_DbContext
Set
database
context
for
JSectFooter
length
subsequent
operations
offset
JSectHeader
[LSN
7]
fileNo
DurOp
data[length]
DurOp
length
offset
Write
Operation
DurOp
fileNo
data[length]
JSectFooter
length
…
offset
fileNo
data[length]
17. Journal Performance
• On 99.9% read systems, no impact
• Write performance degraded 5-30% when
journal on same drive
• Separate drive as low as 3%
18. Journal Admin
• Journal stored in /dbpath/journal folder
• If faster, three 1gb files may be preallocated
• Can symlink to a different spindle
• --journalCommitInterval* (2ms - 300ms)
• When to journal
• Single node: required for data integrity
• Replica set: at least 1 node
• All nodes: removes possible need to resync
19. Fragmentation
• Files may become fragmented over time if
documents change size
• Free lists also contribute to fragmentation
• 2.0 reduced scanning to reasonable amounts
• 2.2 will change allocation strategy
• Need to re-write free list to do online compaction
20. Compaction
• 1.8 and previous: repairDatabase
• 2.0+ : compact command
• Currently resets paddingFactor, but can be
changed.
• Index (re)generation is now concurrent, so
compaction can be N times faster
• Generally causes some extra allocation
• Does not delete or truncate files
21. Planned Changes
• Split data and indexes into different files
• Indexes could by symlinked to a different
drive (SSD)
• Improved allocation strategy