[Www.pkbulk.blogspot.com]dbms12

Disk Storage Management
1
Fall 2001 Database Systems 1
Indexing
• Indexing is a combination of methods for speeding up the
access to the data in a database
• The speed is determined by two factors
– where the data is stored on disk
– available access paths
• The primary access method to any table is a table scan,
reading a table by finding all tuples one by one.
• Indexing creates multiple access paths to the data, each of
which is called a secondary access method.
– indexing speeds up access to a table for a specific set of
attributes
Indexing
• Example: items(itemid, description, price, city)
– to find all items in “Dallas”, read all tuples from secondary
storage and check if city=“Dallas” for each one (scan)
• may read many extra tuples that are not part of the answer
– suppose instead we create index itemcity for the city attribute of
the items relation
itemcity, Dallas:{t1,t5,t10}, Boston:{t2,t3,t15}, …
– to find all items for “Dallas”, we can now find the index entry for
“Dallas” and get the ids of just the tuples we want
• many fewer tuples are read from secondary storage

2
Disk terminology
rotation
Platters
(2 platters = 4 read/write surfaces
Read/write heads
one for each surface
Disk arm
controller
Track
Can read four tracks from four
surfaces at the same time (all
tracks at the same radius are
called a cylinder)
Disk terminology
• Reading data from a disk involves:
– seek time -- move heads to the correct track/cylinder
– rotational latency -- wait until the required data spins under the
read/write heads (average is about half the rotation time)
– transfer time -- transfer the required data (read/write data to/from
a buffer)
• Tracks contain the same amount of information even though they
have different circumferences
track
Block/page
A block or a page is the
smallest unit of data
transfer for a disk.
Read a block / write a
block

3
Disk Space
• Disk is much cheaper than memory (1GB: $15 - $20)
• A fast disk today:
– 40 - 10 GB per disk
– 4.9 ms average read seek time
– 2.99 ms average latency
– 10,000 rpm
– 280 - 452 Mbits/sec transfer rate
– 7.04 bits per square inch density
– 12 - 3 heads, 6 - 2 disk platters
• Disk is non-volatile storage that
survives power failures
Reading from disk
• Reading data from disk is extremely slow (compared to reading from
memory)
• To read a page from disk
– find the page on disk (seek+latency times)
– transfer the data to memory/buffer (total # bytes * transfer rate)
• Assume the average page size is 4KB. To retrieve a single row/tuple, we
need to load the page that contains it
– assume 300 Mbits/sec transfer rate, to read a page:
4KB=32764bits=0.03Mbits hence we take 1/10000 of a second
– 4.9 ms (seek) + 2.99 ms (latency) + 0.1 ms (transfer time) = 7.99 ms
(seek and latency times dominate!)

4
Reading from disk
• Assume the database saves a number of memory slots (each
holding exactly one page), which are called buffers
• To read / modify / write tuple t
– DISK: (read it from disk, write it to buffer)
– DB: (read it from buffer, modify)
– DISK: (write it to disk, free the buffer space)
Buffer slots
This buffer can hold
4 pages at any time
Tablespaces
• Age old wisdom: if you store a set of pages in contiguous pages /
blocks on disk, then the transfer time will improve greatly (reduce
seek and latency times)
• A tablespace is an allocation of space in secondary storage
– when creating a tablespace, a DBMS requests contiguous
blocks of disk space from the OS
– the tablespace appears as a single file to the OS
– the management of physical addresses in a tablespace is
performed by the DBMS
– a DBMS can have many tablespaces
– when a table is created, it is placed in a tablespace, or
partitioned between multiple tablespaces

5
Tablespaces
CREATE TABLESPACE tspace1
DATAFILE ‘diska:file1.dat’ SIZE 20M,
DATAFILE ‘diska:file2.dat’ SIZE 40M REUSE;
CREATE TABLE temp1 (
…
TABLESPACE file1
STORAGE (initial 6144, next 6144,
minextents 1, maxextents 5)
) ;
CREATE TABLE temp2 (
…
TABLESPACE file2
STORAGE (initial 12144, next 6144,
minextents 1, maxextents 5)
) ;
tspace1
file1 file2
temp1 temp2
Actual data
Tablespaces
• Create table -- assign the tuples in the table to a file in a tablespace
– when a table is created, a chunk of space is assigned to this
table, the size of this chunk is given by the “INITIALEXTENT”
– when the initial extent becomes full, a new chunk is allocated,
the size of all next chunks is given by the “NEXTEXTENT”
– can also specify
• maxextents, minextents
• pctincrease (increase the size of extents at each step)
• pctfree (how much of the extent must be left free

6
Data Storage on pages
• Layout of a single disk page (assume fixed size rows)
• To find a specific row in a page, must know
– page number (or block number) BBBBBBBB
– offset (slot number of record within the page) SSSS
– file name (which datafile/tablespace) FFFF
– ROWID is then a unique number
BBBBBBBB.SSSS.FFFF for a row
• B,S,F are hexadecimal numbers
Header
info row directory
1 2 N...
Free space
Data rows
Row N Row N-1 Row 1...
Pseudocolumns
• Since each tuple has a unique rowid, we can refer to the
tuples with their rowid field
• However, rowid may
change if the tuple is
stored at a different
location (the valueof
its primary key is a
better identifier)

7
Indexing Concepts
• Indexing speeds up access to data residing on disk
– disk access is much slower than main memory access,
by orders of magnitude
– goal – minimize the number of disk accesses
• Primary access methods rely on the physical location of
data as stored in a relation
– finding “all” tuples with value “x” requires reading the
entire relation
• Secondary access methods use a directory to enable
tuples to be found more quickly based on the value of one
or more attributes (keys) in a tuple
Secondary Index
• To create a simple index on column A of table T, make a
list of pairs of the form
(attribute A value, tuple rowid)
for each tuple in T
– example: secondary index for the SSN attribute
SSN ROWIDs (RID)
111-11-1111 AAAAqYAABAAAEPvAAH
222-22-2222 AAAAqYAABAAAEPvAAD
333-33-3333 AAAAqYAABAAAEPvAAG
. . . .
• This index is large and stored on the disk

8
Secondary Index
• Suppose a disk page can contain 200 index entries
from a secondary index
• To store a secondary index for a relation with 1
million tuples assuming no duplicate values requires:
1,000,000 / 200 = 5,000 disk pages
• To find a particular Person tuple in the SSN index
given his or her SSN, you must on average scan half
of the index (5,000 / 2 = 2500 disk accesses)
• If 20 tuples of the Person relation fit on a page, then
sequential scan of the relation itself needs to read on
average half the relation (50,000 / 2 = 25,000 disk
accesses)
• In this case, the secondary index helps a lot
Efficiency
• Need to organize the index information in a
way that makes it efficient to access and
search
– scanning the index from the beginning is
not good enough
• Sorting the secondary index helps, but is not
sufficient
• Solution 1: build a tree index
• Solution 2: hash the index

9
Tree Indices
• Want to minimize number of disk accesses.
– each tree node requires a disk access
– therefore, trees that are broad and shallow
are preferred over trees that are narrow
and deep
• Balanced binary search tree, AVL tree, etc.
that are useful in main memory are too narrow
and deep for secondary storage.
• Need an m-way tree where m is large.
– also need a tree that is balanced
B+-Tree
• A B+ -Tree of order d is a tree in which:
– each node has between d and 2d key values
– the keys values within a node are ordered
– each key in a node has a pointer immediately before
and after it
• leaf nodes: pointer following a key is pointer to
record with that key
• interior nodes: pointers point to other nodes in the
tree
– the length of the path from root to leaf is the same for
every leaf (balanced)
– the root may have fewer keys and pointers

10
Example B+ -Tree
66 69 71 762 7 11 15 22 30 41 53 54 63 78 84 93
53
11 30 66 78
B+-Tree of order 2
each node can hold up to four keys
Searching in B+-Trees
Search(T, K) /* searching for tuple with key value K in tree T */
{
if T is a non-leaf node
search for leftmost key K’ in node T such that K < K’
if such a K’ exists
ptr = pointer in T immediately before K’
return the result of Search(ptr, K)
if no such K’ exists
ptr = rightmost pointer in node T
return the result of Search(ptr, K)
else if T is a leaf node
search for K in T
if found, return the pointer following K
else return NULL /* K not in tree T */
}

11
Insert Algorithm
• To insert a new tuple with key K and address rowid:
– use a modified Search algorithm to look for the leaf
node into which key K should be inserted
– insert key K followed by address rowid into the
proper place in this leaf node to maintain order and
rebalance the tree if necessary
• Rebalancing the tree
– if the leaf node has room for K and rowid, then no
rebalancing is needed
– if the leaf node has no room for K and rowid, then it
is necessary to create a new node and rebalance
the tree
Rebalancing Algorithm
• Assume that K and rowid are to be inserted into leaf node L,
but L has no more room.
– create a new empty node
– put K and rowid in their proper place among the entries in
L to maintain the key sequence order -- there are 2d+1
keys in this sequence
– leave the first d keys with their rowids in node L and move
the final d+1 keys with their rowids to the new node
– copy the middle key K’ from the original sequence into the
parent node followed by a pointer to the new node
• put them immediately after the pointer to node L in the
parent node
– apply this algorithm recursively up the tree as needed

12
Insert Example
Insert record with key 57
B+–Tree of order 2
66 69 71 762 7 11 15 22 30 41 53 54 63 78 84
51
11 30 66 78
57
66 69 71 762 7 11 15 22 30 41 53 54 63 78 84
51
11 30 66 78
66 69 71 762 7 11 15 22 30 41 53 54 57 63 78 84
53
11 30 66 78
Another Insert Example
66 69 71 762 7 11 15 22 30 41 53 54 57 63 78 84
51
11 30 66 78
Insert record with key 65
B+–Tree of order 2
66 69 71 762 7 11 15 22 30 41 53 54 57 63 78 84
51
11 30 66 78
65
66 69 71 762 7 11 15 22 30 41 53 54 57 63 78 84
51
11 30 66 78
66 69 71 762 7 11 15 22 30 41 53 54 78 84
51
11 30 66 78
57 63 65
57
66 69 71 762 7 11 15 22 30 41 53 54 78 84
53
11 30 57 66 78
57 63 65

13
Insertion Algorithm (1)
Insert (T, K, rowid, child)
/* insert new tuple with key K and address rowid into tree T */
/* child is NULL initially */
{
/* handle an interior node of the B+-Tree */
find j such that Kj ≤ K ≤ Kj+1 for keys K1, …, Kn in T
ptr = pointer between Kj and Kj+1
Insert (ptr, K, rowid, child)
if child is NULL then return
/* must insert key and child pointer into T */
if T has space for another key and pointer
put child.key and child.ptr into T at proper place
child = NULL
return
Insert Algorithm (2)
else /* must split node T */
construct sequence of keys and pointers from T with
child.key and child.ptr inserted at proper place
first d keys and d+1 pointers from sequence stay in T
last d keys and d+1 pointers from sequence move to
a new node N
child.key = middle key from sequence
child.ptr = pointer to N
if T is root
create new node containing pointer to T,
child.key, and child.ptr
make this node the new root node of the B+-Tree
return
/* handle leaf node of the B+-Tree */
If T is a leaf node
if T has space for another key and rowid
put K and rowid into T at proper place
return

14
Insert Algorithm (3)
else /* must split leaf node T */
construct sequence of keys and pointers from T with
K and rowid inserted at proper place
first d keys and d+1 pointers from sequence stay in T
last d+1 keys and d+2 pointers from sequence move to
a new node N
child.key = first key in new node N
child.ptr = pointer to N
if T was root
create new node containing pointer to T,
child.key, and child.ptr
make this node the new root node of the B+-Tree
return
}
Deletion
• Assume that a tuple with key K and address rowid is
to be deleted from leaf node L. There is a problem if
after removing K and rowid from L it has fewer than d
keys remaining. To fix this:
– if a neighbor node has at least d+1 keys, then
evenly redistribute the keys and rowids with the
neighbor node and adjust the separator key in the
parent node
– otherwise, combine node L with a neighbor node
and discard the empty node
• the parent node now needs one less key and
node pointer, so recursively apply this algorithm
up the tree until all nodes have enough keys
and pointers

15
Deletion Example
Redistribute between
the second and
third leaf nodes.
66 69 71 762 7 53 54 63 78 84 93
53
11 30 66 78
B+-Tree of order 2
30
Delete key 30
30 4111 15 22 4122 4111 15
22
Another Deletion Example
Cannot redistribute,
so combine the left
two leaf nodes
66 69 71 762 7 11 15 30 41 53 54 63 78 84 93
53
11 30 66 78
B+-Tree of order 2
Delete 7 from the B-Tree
2 11 15

16
Another Example Continued
B+-Tree of order 2
Delete 7 from the B+-Tree
66 69 71 7612 15 30 41 53 54 63 78 84 93
51
30 66 78
2 11 15
Node not valid,
too few pointers
Cannot redistribute,
so combine with
sibling node
66 69 71 7612 15 30 41 53 54 63 78 84 93
51
30 66 78
2 11 15 66 69 71 7612 15 30 41 53 54 63 78 84 93
30 51 66 78
2 11 15 66 69 71 7612 15 30 41 53 54 63 78 84 93
30 53 66 78
2 11 15
Deletion Algorithm (1)
Delete (Parent, T, K, oldchild)
/* delete key K from Tree T */
/* Parent is parent node for T, initially NULL */
/* oldchild is discarded child node, initially NULL */
{
/* handle an interior node of the B+-Tree */
find j such that Kj ≤ K ≤ Kj+1 for keys K1, …, Kn in T
ptr = pointer between Kj and Kj+1
Delete (T, ptr, K, oldchild)
if oldchild is NULL then return
/* must handle discarded child node of T */
remove oldchild and adjacent key from T
if T still has enough keys and pointers
oldchild = NULL
return

17
/* must fix node T */
get a sibling node S of T using Parent
if S has entry keys /* redistribute S and T */
redistribute keys and adjacent pointers evenly
between S and T
K’ = middle unused key from the redistribution
replace the key in Parent between the pointers
to S and T with K’
oldchild = NULL
return
else /* merge S and T */
R = S or T, whichever is to the right of the other
oldchild = R
copy key from Parent node that is immediately
before R to the end of the node on the left
move all keys and adjacent pointers from R
to the node on the left
discard node R
return
/* handle leaf node of the B+-Tree */
if T is a leaf node
if T has extra keys
remove key K from T
oldchild = NULL
return
/* must fix node T */
get a sibling node S of T using Parent
if S has entry keys /* redistribute S and T */
redistribute keys and adjacent pointers evenly
between S and T
K’ = first key from node S or T, whichever is to
the right of the other
replace the key in Parent between the pointers
to S and T with K’
oldchild = NULL
return

18
else /* merge S and T */
R = S or T, whichever is to the right of the other
oldchild = R
move all keys and adjacent rowids from R
to the node on the left
discard node R
return
}
Analysis of B+-Trees
• Every access to a node is an access to disk and hence is
expensive.
• Analysis of Find:
– if there are n tuples in the tree, the height of the tree, h,
is bounded by h

ceil(log
d
(n))
– example: d = 50, tree contains 1 million records, then h

4
• Analysis of Insert and Delete
– finding the relevant node required h accesses
– rebalancing required O(h) accesses
– therefore, the total is O(log
d
n) accesses

19
B+-tree
• The create index command creates a B+-tree index
CREATE INDEX age_idx ON people(age)
TABLESPACE file1
PCTFREE 70
• PCTFREE defines how full each node should be
• Optimal operation is usually with nodes about 70% full
• To reduce disk accesses for sequential processing,
pointers are added to the leaf nodes that point to the
previous and next leaf nodes
A B+-Tree Example
• Givens:
– disk page has capacity of 4K bytes
– each rowid takes 6 bytes and each key value takes 2
bytes
– each node is 70% full
– need to store 1 million tuples
• Leaf node capacity
– each (key value, rowid) pair takes 8 bytes
– disk page capacity is 4K, so (4*1024)/8 = 512 (key value,
rowid) pairs per leaf page
• in reality there are extra headers and pointers that we
will ignore
• Hence, the degree for the tree is about 256

20
Example Continued
• If all pages are 70% full, each page has about
512*0.7 = 359 entries
• To store 1 million tuples, requires
1,000,000 / 359 = 2786 pages at the leaf
level
2789 / 359 = 8 pages at next level up
1 root page pointing to those 8 pages
• Hence, we have a B+-tree with 3 levels, and a
total of 2786+8+1 = 2795 disk pages
Duplicate Key Values
• Duplicate key values in a B+-tree can be handled.
– (key, rowid) pairs for same key value can span
multiple index nodes
• Search algorithm needs to be changed
– find leftmost entry at the leaf level for the searched
item, then scan the index from left to right following
leaf level pointers
• The insertion and deletion algorithms also require small
changes
– they are more costly and hence not always
implemented in practice

21
Bitmap Index
• For some attribute x with possible values A,B and C:
– create a list of all tuples in the relation and store their
rowids at some known location
– build an index for each value, for example for value A
• the bitmap contains a “1” at location k if tuple k has
value “A” for this attribute
• otherwise it contains a “0”
– indices with a lot of “0”s are called sparse and can be
compressed
s2 15 3 s9 . . .
A
s5 15 2 s8 . . .
A
Bitmap Example
s1 10 6 s7 . . .
A
s4 10 3 s6 . . .
A
s3 15 4 s10 . . .
A
Tuples . . .
. . .Tuple
List
1 0 0 1 0 . . .
0 1 1 0 1 . . .
Bitmap for A=10
Bitmap for A=15

22
Querying with Bitmap Index
• Suppose have bitmap indices on attributes x and y
– Find if x=“A” or x=“B”, take the bitmaps for both
values and do a logical or
– Find if x=“A” and y<>“B”, compute the logical inverse
of bitmap for y=“B” and then do a logical and with
bitmap for x=“A”
• Bitmaps depend on the actual row ids of tuples
• If a tuple is deleted, its location can be removed or
swapped by another tuple (costly if the index is
compressed)
• Too many updates or attributes with too many values
lead to bitmaps that are not cost effective
Row directory Tuple 1, Tuple 2, … , Tuple 10
B+-tree index
on attributes A1,…,Ak
Primary access methods
Heap: tuples are placed in the order
they are inserted
Cluster: tuples with the same values
for attributes A1,…,Ak are placed
close to each other on disk
Hash: tuples with the same hash value
are placed close to each other on disk
Secondary access
methods
The primary access
method can be anything.
Additional indexes are
created with entries that
point to actual tuples
Row directory Tuple 11, Tuple 12, … , Tuple 20

23
Clusters
• A cluster is a primary access method, it changes
the placement of tuples on disk
CREATE CLUSTER personnel
(department_number integer)
SIZE 512
STORAGE (INITIAL 100K NEXT 50K)
• In ORACLE, a cluster can be generated for many
tables containing the same set of attributes
• All tuples in different tables from the same cluster
will be placed closed to each other on disk (i.e. on
the same page and on consecutive pages)
Adding tables to a cluster

24
Clusters
• Each table may belong to at most one cluster.
• Suppose we retrieve an employee tuple with deptno=10.
We find a page with this employee and read it into memory.
• If there are 20 employees in the department 10, then
chances are that all these employees are on the same
page.
• To find all employees in department 10 through 20, we can
simply read the necessary pages.
• A cluster is not an index, but we can also create a B+-tree
index on a cluster:
CREATE INDEX idx_personnel ON CLUSTER personnel;
Hashing
• Hashing is another index method that changes the way tuples are
placed on disk
• A hash index on attribute A allocates an initial set of pages to store
the relation:
1
2
3
n
.
.
New tuple T
with key A
Hash function h
ranges between
1 and n
h(T. A)If multiple tuples map to the same
location/page, this is called a
collision. These tuples are placed
in an overflow page.

25
Hashing
• The number of key values is given by HASHKEYS
• Hashing is useful for finding a tuple with a given key value
• Hashing is not as useful for ranges or key values or for
sequential processing of tuples in key order
• In the best case, a tuple is found with one disk access
• In the average case, expect 1.2 disk accesses or more
(because of overflow pages)
Extensible Hashing
• Assume that we originally allocate 2n pages for the hash
• Distribute tuples according to hash function mod 2
– hash the key to produce a bit string and then use the least
significant bit
• If a disk page becomes full, double the directory size instead of
creating overflow buckets
Page 0
Page 1
Hash directory
tuples
0
1

26
Extensible Hashing
• Insert into a full Page 1 – double the directory size
The full page is split into two. Its
contents are rehashed between
the original page and the new page,
using one additional bit from the string
produced by the hash function
Just created a new directory entry for
Page 3. Since Page 1 is not full, this
directory entry points to Page 1. The
contents of Page 1 will be rehashed
with Page 3 when Page 1 becomes full
rather than creating a new page.
Page 0
Page 1
Page 2
Page 3
Hash directory
00
01
10
11
New Page
Extensible Hashing
• As the hash directory size grows it must be stored on
the disk
• At most two disk accesses are needed to retrieve any
tuple
– this is a better upper bound than for B+-Trees
• However, extensible hashing is not as good as B+-
Trees for range queries and sequential processing
where you want to process all the tuples of a relation
• Consequently, B+-Trees are used more frequently
than Extensible Hashing

[Www.pkbulk.blogspot.com]dbms12

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to [Www.pkbulk.blogspot.com]dbms12

Similar to [Www.pkbulk.blogspot.com]dbms12 (20)

More from AnusAhmad

More from AnusAhmad (18)

Recently uploaded

Recently uploaded (20)

[Www.pkbulk.blogspot.com]dbms12