2. Data Structure vs. File Structure
2
Data Structure : How to arrange data in
memory
File Structure : How to arrange data in Disk
and/or any other secondary storage
DataBase and DataBase Management System
Users do NOT have to care about how to store data in a
file. DBMS will handle the detail.
Users can use SQL (Structured Query Language) to
access the DataBase in an interactive command and/or
through a program (embeded SQL)
RAM Disk
3. File Types
3
Sequential file: one accessed in a serial manner from
beginning to end. E.g. audio, video, text, programs.
Text file: sequential file in which each logical record
is a single character.
ASCII: 1 byte/char
Unicode: 2 bytes/char
4. Sequential Files
4
Sequential file: A file whose contents can only be
read in order
Reader must be able to detect end-of-file (EOF)
Data can be stored in logical records, sorted by a key field
Greatly increases the speed of batch updates
5. Text Files
5
Simple file structure.
Extendable to more complex file structures using
markup languages (XHTML, HTML).
XHTML, HTML control the display of the file on a
monitor.
XML is a standard for markup languages.
6. Converting data from two’s
complement notation into ASCII for
6
storage in a text file
14. Hashing
14
In hashing the index file is replaced by a hash
function.
The storage space is divided into buckets.
Each record has a key field. Each record is stored in the
bucket corresponding to the hash of its key.
A hash function computes a bucket number for each
key value.
Advantage: no index table needed.
Disadvantages:
i) hash function needs careful
design;
ii) unpredictable performance
15. Terminology
15
Bucket: section of the data storage area.
Key: identifier for a block of information.
Hash function: takes as input a key and outputs a
bucket number.
Collision: two keys yield the same bucket number.
16. Hash Functions
16
Hash Function Requirements
Easily and quickly computed.
Values evenly spread over the bucket numbers.
What can go wrong: bucket number computed from 1st
and 3rd characters of a name:
Brown, Brook, Broom, Broadhead,
Biot, Bloom,
…
Examples of Hash Functions
Mid square: compute (key x key) and
set bucket number = middle digits.
Extraction: select digits from certain positions within
the key.
Divide key by number of buckets and use the
remainder.
18. The rudiments of a hashing system, in which each
bucket holds those records that hash to that
18
bucket number
19. Collisions in Hashing
19
Collision: The case of two keys hashing to the
same bucket
Clustering problem: Poorly designed hashing function
can have uneven distribution of keys into buckets
Collision also becomes a problem when there aren’t
enough buckets (probability greatly increases as load
factor (% of buckets filled) approaches 75%)
Solution: somewhere between 50% and 75% load factor,
increase number of buckets and rehash all data
24. Information Required on a Hard Drive to Load
24
Startup BIOS (POST , Load MBR)
Master Boot Record (MBR)
Master Boot Program
Partition Table (16 bytes * 4)
OS Boot Record (Boot Sector)
Loads the first program file of the OS
Boot Loader Program
Begins process of loading OS into memory
an OS
25. How Data Is Logically Stored on a Floppy Disk
25
All floppy and hard disk drives are divided into
tracks and sectors
Tracks are concentric circles on a disk
Sector
Always 512 bytes
Physical organization of a disk
BIOS manages disk as sectors
Cluster (file allocation unit)
Group of sectors
Logical organization of a disk
OS views disk as a list of clusters
26. The Boot Record
26
Track 0, sector 1 of a floppy disk
Contains basic information about how the disk is
organized
Includes bootstrap program, which can be
used to boot from the disk
Uniform layout and content of boot record allows
any version of DOS or Windows to read any DOS
or Windows disk
27. The File Allocation Table (FAT)
27
Lists the location of files on disk in a one-column
table
Floppy disk FAT is 12 bits wide, called FAT12
Each entry describes how a cluster on the disk is
used
A bad cluster on the disk will be marked in the
FAT
28. The Root Directory
28
Lists all the files assigned to this table
Contains a fixed number of entries
Some items included are:
Filename and extension
Time and date of creation or last update
File attributes
First cluster number
29. How a Hard Drive is Logically Organized to Hold
Low-level format
Data
29
Creates tracks and sectors, done at factory
Partition the hard drive (FDISK.EXE)
Creates partition table at the beginning of drive
High-level format
Done by OS for each logical drive
Master Boot Record (MBR) is the first 512 bytes
of a hard drive
Master boot program (446 bytes) calls boot program to
load OS
Partition table
Description, Location, Size
31. FAT16
31
Supported by DOS and all versions of Windows
Uses 16 bits for each cluster entry
As the size of the logical drive increases, FAT16
cluster size increases dramatically
32. FAT32
32
Became available with Windows 95 OSR2
Used 32 bits per FAT entry, although only 28 bits
were used to hold cluster numbers
More efficient than FAT16 in terms of cluster size
33. NTFS
33
Supported by Windows NT/2000/XP
Provides greater security
Used a database called the master file table
(MFT) to locate files and directories
Supports large hard drives
38. Schemas
38
Schema: A description of the structure of an entire
database, used by database software to maintain the
database
Subschema: A description of only that portion of
the database pertinent to a particular user’s needs,
used to prevent sensitive data from being accessed
by unauthorized personnel
39. Database Management Systems
39
Database Management System (DBMS): A
software layer that manipulates a database in
response to requests from applications
Distributed Database: A database stored on
multiple machines
DBMS will mask this organizational detail from its users
Data independence: The ability to change the
organization of a database without changing the
application software that uses it
41. Relational Database Model
41
Relation: A rectangular table
Attribute: A column in the table
Tuple: A row in the table
Relational Design
Avoid multiple concepts within one relation
Can lead to redundant data
Deleting a tuple could also delete necessary but unrelated
information
42. Improving a Relational Design
42
Decomposition: Dividing the columns of a
relation into two or more relations, duplicating those
columns necessary to maintain relationships
Lossless or nonloss decomposition: A “correct”
decomposition that does not lose any information
55. Maintaining Database Integrity (1/2)
60
Transaction: A sequence of operations that must
all happen together
Example: transferring money between bank accounts
Transaction log: A non-volatile record of each
transaction’s activities, built before the transaction
is allowed to execute
Commit point: The point at which a transaction has
been recorded in the log
Roll-back: The process of undoing a transaction
56. Maintaining database integrity
(2/2)
61
Simultaneous access problems
Incorrect summary problem
Lost update problem
Locking = preventing others from accessing data
being used by a transaction
Shared lock: used when reading data
Exclusive lock: used when altering data
57. Data Mining
62
Data Mining: The area of computer science that
deals with discovering patterns in collections of data
Data warehouse: A static data collection to be
mined
Data cube: Data presented from many perspectives to
enable mining
Data Mining Strategies
Class description
Class discrimination
Cluster analysis
Association analysis
Outlier analysis
Sequential pattern analysis
58. Social Impact of Database Technology
63
Problems
Massive amounts of personal data are being collected
Often without knowledge or meaningful consent of affected
people
Data merging produces new, more invasive information
Errors are widely disseminated and hard to correct
Remedies
Existing legal remedies often difficult to apply
Negative publicity may be more effective