doc/unit-4.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207

# Unit 4

## Section 1: Physical Database Design

* transform logical database model into a physical database model.
* physical model is the last stage of database design.
* detailed specification of the database structure.
  * definition of:
    * physical names
    * formats
    * locations of physical records
    * their fields

Read: Chapter 5
* the physical database design process
* designing fields
* denormalizing and partitioning data

Important terms:

* field:
* data type:
* data integrity:
* denormalization:
* horizontal partitioning:
* vertical partitioning:


### The physical database design process

* goal: data procesing effeciency

#### critical decisions

* storage format: data type
* provide dbms with hints on how to group attributes
* provide dmbs with hints on how to arrange similar data
* select structures for storing and connecting files efficiently. (indexes, db architecture)
* prepare strategies for handling queries. optimize for perf.

#### Usage analysis

* add notation to the EER diagram
  * represents final set of normalized relations

* record data volume
* record access frequency

#### Designing Fields

> Field: The smallest unit of application data reconized by system software.

* type of data
* data integrity controls
* handling of missing values for the field

> data type: a detailed coding scheme recognized by software system for representing organizational data.

* space to store data
* speed required to access data

1. represent all possible values
2. improve data integrity
3. support all data manipulations
4. minimize storage space

[Postgresql Data Types](https://www.postgresql.org/docs/current/datatype.html)

* Numeric
* monetary
* character
* binary
* date/time
* boolean
* enum
* geometric
* network address types
* bit string types
* text search types
* uuid types
* xml types
* json types
* arrays
* composite types
* range types
* domain types
* object identifier types

#### Data Integrity

* default value
* range control
* null value control
* referential integrity control

#### Handling Missing Data

* default value
* do not allow null
* substitute estimate of missing data
* track missing data so it can be resolved manually
* perform sensitivity testing

### Denormalizing

noramlized relations solve maintenance anomolies and
minimize redundancies. This may not yield efficient data processing
if the data required is across many relations. i.e JOINS are costly
if the data isn't near each other.

> Denormalization: process of transforming normalized relations into non-normalized physical record specifications.

Opportunities for denormalization

1. two entities with a one-to-one relationship
2. many-to-many relationship (associative entity) with nonkey attributes
  * instead of joining three files to extract data from two basic entities
3. reference data

Dangers of denorm

* increase chance of errors and inconsistencies
* optimization for one type of processing make affect a different type of processing
* more storage space requirements

### Partitioning

Horizontal Partitioning

> Horizontal partitioning: Distribution of the rows of a logical relation into several separate tables.

Partition key

* single column
* date: usually a qualifier in queries

Makes table maintainenance easier because fragementing and rebuliding can be isolated to single partitions
as storage space needs to be organized.

File level security can be applied to prohibit users from seeing certain rows of data.
Taking one of the partitioned files out of service because it was damaged or so it can be recovered
still allows processing against the other partitioned files to continue.

Each partitioned file can be placed on a separate disk drive to reduce contention for the same drive
and hence improve query and maintenance performance across the database.


Horizontal partitioning is similar to creating supertype/subtype relationship because different types
of the entity are involved in different relationships.

Advantages

1. Efficiency: Data queried together are stored close to one another and separate from data not used together.
2. Local optimization: Each partition of data can be stored to optimize performance for it's own use.
3. Security: Data not relevant to one group of users can be segregated from data those users are allowed to use.
4. Recovery and uptime: Smaller files take less time to back up and recover and other files are still accessible if one file is damaged, so the effects of damage are isolated.
5. Load balancing: Files can be allocated to different storage areas, which minimizes contention for access to the same storage area or even allows for parallel access to the different areas.

Disadvantages

1. Inconsistent access speed: Different partitions may have different access speeds, thus confusing users. If data needs to be accessed across partitions, this produces slower response times.
2. Complexity: is usually not transparent to application code, which requires knowledge of which partition to read/write from.
3. Extra space and update time: Data may be duplicated across partitions, taking extra storage space compared to all data in normalized files. Updates that affect multiple partitions will be slower.

Range partitioning

* Each partition is defined by a range of values for one or more columns of the normalized table.
* A table row is inserted in the proper partition based on it's initial values for the range fields.
* Each partition may hold a different number of rows.
* A partition key that can produce a more balanced distribution of rows is better.
* A row may be restricted fro moving between partitions when key values are updated.

Hash Partitioning

* data are evenly spread across partitions independent of any partition key value.
* overcomes the uneven distribution of rows that is possible with range partitioning.
* works well when distributing data evenly across devices is necessary.

List Partitioning

* partitions are defined based on predefined list of values of the partitioning key.
* E.g. `province` partition key

Vertical Partitioning

> Vertical Partitioning: Distribution of the columns of a logical relation into several separate physical tables.

Data replication

* The same data is stored in multiple places in the database.

## Section 2: File Organization and Database Architectures

Outcomes:

* discuss criteria for organizing file organization
* compare three basic file organization methods
* organize files according to the three basic file organization methods
* cluster records to reduce the data access time
* determine when to use indexes
* describe parallel processing and explain how it is used to improve performance

Read:

* Designing physical database files
* Using and selecting indexes
* Designing a database for optimal query performance