# Unit 4

## Section 1: Physical Database Design

* transform logical database model into a physical database model.
* physical model is the last stage of database design.
* detailed specification of the database structure.
  * definition of:
    * physical names
    * formats
    * locations of physical records
    * their fields

Read: Chapter 5

* the physical database design process
* designing fields
* denormalizing and partitioning data

Important terms:

* field:
* data type:
* data integrity:
* denormalization:
* horizontal partitioning:
* vertical partitioning:


### The physical database design process

* goal: data procesing effeciency

#### critical decisions

* storage format: data type
* provide dbms with hints on how to group attributes
* provide dmbs with hints on how to arrange similar data
* select structures for storing and connecting files efficiently. (indexes, db architecture)
* prepare strategies for handling queries. optimize for perf.

#### Usage analysis

* add notation to the EER diagram
  * represents final set of normalized relations

* record data volume
* record access frequency

#### Designing Fields

> Field: The smallest unit of application data recognized by system software.

* type of data
* data integrity controls
* handling of missing values for the field

> data type: a detailed coding scheme recognized by software systems for representing organizational data.

* space to store data
* speed required to access data

1. represent all possible values
2. improve data integrity
3. support all data manipulations
4. minimize storage space

[Postgresql Data Types](https://www.postgresql.org/docs/current/datatype.html)

* Numeric
* monetary
* character
* binary
* date/time
* boolean
* enum
* geometric
* network address types
* bit string types
* text search types
* uuid types
* xml types
* json types
* arrays
* composite types
* range types
* domain types
* object identifier types

#### Data Integrity

* default value
* range control
* null value control
* referential integrity control

#### Handling Missing Data

* default value
* do not allow null
* substitute estimate of missing data
* track missing data so it can be resolved manually
* perform sensitivity testing

### Denormalizing

normalized relations solve maintenance anomalies and
minimize redundancies. This may not yield efficient data processing
if the data required is across many relations. i.e JOINS are costly
if the data isn't near each other.

> Denormalization: process of transforming normalized relations into non-normalized physical record specifications.

Opportunities for denormalization

1. two entities with a one-to-one relationship
2. many-to-many relationship (associative entity) with nonkey attributes
  * instead of joining three files to extract data from two basic entities
3. reference data

Dangers of denorm

* increase chance of errors and inconsistencies
* optimization for one type of processing may affect a different type of processing
* more storage space requirements

### Partitioning

Horizontal Partitioning

> Horizontal partitioning: Distribution of the rows of a logical relation into several separate tables.

Partition key

* single column
* date: usually a qualifier in queries

Makes table maintenance easier because fragmenting and rebuilding can be isolated to single partitions
as storage space needs to be organized.

File level security can be applied to prohibit users from seeing certain rows of data.
Taking one of the partitioned files out of service because it was damaged or so it can be recovered
still allows processing against the other partitioned files to continue.

Each partitioned file can be placed on a separate disk drive to reduce contention for the same drive
and hence improve query and maintenance performance across the database.

Horizontal partitioning is similar to creating supertype/subtype relationship because different types
of the entity are involved in different relationships.

Advantages

1. Efficiency: Data queried together are stored close to one another and separate from data not used together.
2. Local optimization: Each partition of data can be stored to optimize performance for it's own use.
3. Security: Data not relevant to one group of users can be segregated from data those users are allowed to use.
4. Recovery and uptime: Smaller files take less time to back up and recover and other files are still accessible if one file is damaged, so the effects of damage are isolated.
5. Load balancing: Files can be allocated to different storage areas, which minimizes contention for access to the same storage area or even allows for parallel access to the different areas.

Disadvantages

1. Inconsistent access speed: Different partitions may have different access speeds, thus confusing users. If data needs to be accessed across partitions, this produces slower response times.
2. Complexity: is usually not transparent to application code, which requires knowledge of which partition to read/write from.
3. Extra space and update time: Data may be duplicated across partitions, taking extra storage space compared to all data in normalized files. Updates that affect multiple partitions will be slower.

Range partitioning

* Each partition is defined by a range of values for one or more columns of the normalized table.
* A table row is inserted in the proper partition based on it's initial values for the range fields.
* Each partition may hold a different number of rows.
* A partition key that can produce a more balanced distribution of rows is better.
* A row may be restricted from moving between partitions when key values are updated.

Hash Partitioning

* data are evenly spread across partitions independent of any partition key value.
* overcomes the uneven distribution of rows that is possible with range partitioning.
* works well when distributing data evenly across devices is necessary.

List Partitioning

* partitions are defined based on predefined list of values of the partitioning key.
* E.g. `province` partition key

Vertical Partitioning

> Vertical Partitioning: Distribution of the columns of a logical relation into several separate physical tables.

Data replication

* The same data is stored in multiple places in the database.

## Section 2: File Organization and Database Architectures

Outcomes:

* discuss criteria for organizing file organization
* compare three basic file organization methods
* organize files according to the three basic file organization methods
* cluster records to reduce the data access time
* determine when to use indexes
* describe parallel processing and explain how it is used to improve performance

Read:

* Designing physical database files
* Using and selecting indexes
* Designing a database for optimal query performance

## Designing physical database files

> physical file: a named portion of secondary memory (hdd) allocated for the purpose of storing physical records.

> tablespace: a named logical storage unit in which data from one or more database tables, views or other database objects may be stored.

> extent: a contiguous section of disk storage space.

> file organization: is a technique for physically arranging the records of a file on secondary storage devices.

1. fast data retrieval
2. high throughput for processing data input and maintenance transactions
3. efficient use of storage space
4. protection from failures or data loss
5. minimizing need for reorganization
6. accommodating growth
7. security from unauthorized use

Sequential file organization

The records in a file are stored in sequence according to a primary key value.
Not used in databases due to their inflexibility, but may be used for backups.

Indexed File Organization

The records are stored sequentially or non-sequentially and an index is created that allows the
application software to locate individual records.

> Index: is a table that is used to determine in a file the location of records that satisfy some condition.

Each index entry matches a key value with one or more records.
An index that allows each entry to point to more than one record is called a secondary key index.

> Secondary key: one field or a combination of fields for which more than one record may have the same combination of values. Also called a non-unique key.

> join index: An index on columns from two or more tables that come from the same domain of values.

Hashed File Organization

A storage system in which the address of each record is determined using a hashing algorithm.

> Hashing algorithm: is a routine that converts a primary key value into a record address.

Typical hashing algorithm uses the technique of dividing each primary key value by a suitable prime number and then using the remainder of the division as the relative storage location.

> hash index table: a file organization that uses hashing to map a key into a location in an index, where there is a pointer to the actual data record matching the hash key.

> Pointer: a field of data indicating a target address that can be used to locate related field or record of data.

* an index table is smaller than a data table.
* Added expense of storing and maintaining the index space.

File organization overview

* Storage space
  * sequential: no wasted space
  * indexed: no wasted space for data but extra space for index
  * hashed: extra space may be needed to allow for addition and deletion of records after the initial set of records is loaded.
* sequential retrieval on primary key:
  * sequential: very fast
  * indexed: moderately fast
  * hashed: impractical, unless using hash index
* random retrieval on primary key
  * sequential: impractical
  * indexed: moderately fast
  * hashed: very fast
* multiple key retrieval
  * sequential: requires scanning whole file
  * indexed: very fast with multiple indexes
  * hashed: not possible unless using hash index
* deleting records:
  * sequential: can create wasted space or require reorganizing
  * indexed: if space can be dynamically allocated, easy. Requires maintenance of indexes
  * hashed: very eash
* adding new records:
  * sequential: requires rewriting a file
  * indexed: if space can be dynamically allocated, easy. Requires maintenance of indexes.
  * hashed: very easy. Hash collisions need extra work.
* updating records:
  * sequential: usually requires rewriting a file
  * indexed: easy but requires maintenance of indexes
  * hashed: very easy

Using and Selecting indexes

Creating a unique key index

```sql
CREATE UNIQUE INDEX customers_pk ON customers(id);
```

* name: `customers_pk`
* table: `customers`
* column: `id`

Creating a composite unique key.

```sql
CREATE UNIQUE INDEX line_items_pk ON line_items(order_id, product_id);
```

Creating a secondary (non unique) key index

```sql
CREATE INDEX products_fk ON orders(product_id);
```

When to use indexes

There is a trade-off between improved read performance and degraded write performance due to 
index maintenance.

Indexes can be used generously for systems that are read heavy but beware in systems that
are write heavy.

Guidelines:

1. indexes are most useful on larger tables
2. primary key should have a unique index.
3. indexes are useful for columns that appear in `WHERE` clauses.
4. Use indexes for columns that are referenced in `ORDER BY` and `GROUP BY` clauses.
5. Indexes are useful where there is a wide variety of values.
6. Indexes on large values is not optimal. Try to find a smaller coded value to index to represent the larger value.
7. If the key for the index is used to determine the location where the record will be stored then a surrogate key should be used.
8. Check DBMS for limits on the # of indexes on a table. (limits include # of indexes, and size of index)
9. Be careful of indexing attributes that have null values. Some systems don't index `null` so sequential scan is needed for `null` value.

### Designing a database for optimal query performance

Primary purpose of database design is to optimize the performance of database processing.

Parallel query processing

## Summary

De-normalization transforms normalized relations into non-normalized implementations specs.
Horizontal partitioning breaks a relation into multiple record specifications by placing different
rows into different tables, based on common column values.
Vertical partitioning distributes the columns of a relation into separate files, repeating the primary key in each of the files.

A physical file is a named portion of secondary memory for storing physical records.

A file organization arranges the records of a file on a secondary storage.

1. sequential
2. indexed (most popular)
3. hashed