doc/unit-4.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355

# Unit 4

## Section 1: Physical Database Design

* transform logical database model into a physical database model.
* physical model is the last stage of database design.
* detailed specification of the database structure.
  * definition of:
    * physical names
    * formats
    * locations of physical records
    * their fields

Read: Chapter 5

* the physical database design process
* designing fields
* denormalizing and partitioning data

Important terms:

* field:
* data type:
* data integrity:
* denormalization:
* horizontal partitioning:
* vertical partitioning:


### The physical database design process

* goal: data procesing effeciency

#### critical decisions

* storage format: data type
* provide dbms with hints on how to group attributes
* provide dmbs with hints on how to arrange similar data
* select structures for storing and connecting files efficiently. (indexes, db architecture)
* prepare strategies for handling queries. optimize for perf.

#### Usage analysis

* add notation to the EER diagram
  * represents final set of normalized relations

* record data volume
* record access frequency

#### Designing Fields

> Field: The smallest unit of application data recognized by system software.

* type of data
* data integrity controls
* handling of missing values for the field

> data type: a detailed coding scheme recognized by software systems for representing organizational data.

* space to store data
* speed required to access data

1. represent all possible values
2. improve data integrity
3. support all data manipulations
4. minimize storage space

[Postgresql Data Types](https://www.postgresql.org/docs/current/datatype.html)

* Numeric
* monetary
* character
* binary
* date/time
* boolean
* enum
* geometric
* network address types
* bit string types
* text search types
* uuid types
* xml types
* json types
* arrays
* composite types
* range types
* domain types
* object identifier types

#### Data Integrity

* default value
* range control
* null value control
* referential integrity control

#### Handling Missing Data

* default value
* do not allow null
* substitute estimate of missing data
* track missing data so it can be resolved manually
* perform sensitivity testing

### Denormalizing

normalized relations solve maintenance anomalies and
minimize redundancies. This may not yield efficient data processing
if the data required is across many relations. i.e JOINS are costly
if the data isn't near each other.

> Denormalization: process of transforming normalized relations into non-normalized physical record specifications.

Opportunities for denormalization

1. two entities with a one-to-one relationship
2. many-to-many relationship (associative entity) with nonkey attributes
  * instead of joining three files to extract data from two basic entities
3. reference data

Dangers of denorm

* increase chance of errors and inconsistencies
* optimization for one type of processing may affect a different type of processing
* more storage space requirements

### Partitioning

Horizontal Partitioning

> Horizontal partitioning: Distribution of the rows of a logical relation into several separate tables.

Partition key

* single column
* date: usually a qualifier in queries

Makes table maintenance easier because fragmenting and rebuilding can be isolated to single partitions
as storage space needs to be organized.

File level security can be applied to prohibit users from seeing certain rows of data.
Taking one of the partitioned files out of service because it was damaged or so it can be recovered
still allows processing against the other partitioned files to continue.

Each partitioned file can be placed on a separate disk drive to reduce contention for the same drive
and hence improve query and maintenance performance across the database.

Horizontal partitioning is similar to creating supertype/subtype relationship because different types
of the entity are involved in different relationships.

Advantages

1. Efficiency: Data queried together are stored close to one another and separate from data not used together.
2. Local optimization: Each partition of data can be stored to optimize performance for it's own use.
3. Security: Data not relevant to one group of users can be segregated from data those users are allowed to use.
4. Recovery and uptime: Smaller files take less time to back up and recover and other files are still accessible if one file is damaged, so the effects of damage are isolated.
5. Load balancing: Files can be allocated to different storage areas, which minimizes contention for access to the same storage area or even allows for parallel access to the different areas.

Disadvantages

1. Inconsistent access speed: Different partitions may have different access speeds, thus confusing users. If data needs to be accessed across partitions, this produces slower response times.
2. Complexity: is usually not transparent to application code, which requires knowledge of which partition to read/write from.
3. Extra space and update time: Data may be duplicated across partitions, taking extra storage space compared to all data in normalized files. Updates that affect multiple partitions will be slower.

Range partitioning

* Each partition is defined by a range of values for one or more columns of the normalized table.
* A table row is inserted in the proper partition based on it's initial values for the range fields.
* Each partition may hold a different number of rows.
* A partition key that can produce a more balanced distribution of rows is better.
* A row may be restricted from moving between partitions when key values are updated.

Hash Partitioning

* data are evenly spread across partitions independent of any partition key value.
* overcomes the uneven distribution of rows that is possible with range partitioning.
* works well when distributing data evenly across devices is necessary.

List Partitioning

* partitions are defined based on predefined list of values of the partitioning key.
* E.g. `province` partition key

Vertical Partitioning

> Vertical Partitioning: Distribution of the columns of a logical relation into several separate physical tables.

Data replication

* The same data is stored in multiple places in the database.

## Section 2: File Organization and Database Architectures

Outcomes:

* discuss criteria for organizing file organization
* compare three basic file organization methods
* organize files according to the three basic file organization methods
* cluster records to reduce the data access time
* determine when to use indexes
* describe parallel processing and explain how it is used to improve performance

Read:

* Designing physical database files
* Using and selecting indexes
* Designing a database for optimal query performance

## Designing physical database files

> physical file: a named portion of secondary memory (hdd) allocated for the purpose of storing physical records.

> tablespace: a named logical storage unit in which data from one or more database tables, views or other database objects may be stored.

> extent: a contiguous section of disk storage space.

> file organization: is a technique for physically arranging the records of a file on secondary storage devices.

1. fast data retrieval
2. high throughput for processing data input and maintenance transactions
3. efficient use of storage space
4. protection from failures or data loss
5. minimizing need for reorganization
6. accommodating growth
7. security from unauthorized use

Sequential file organization

The records in a file are stored in sequence according to a primary key value.
Not used in databases due to their inflexibility, but may be used for backups.

Indexed File Organization

The records are stored sequentially or non-sequentially and an index is created that allows the
application software to locate individual records.

> Index: is a table that is used to determine in a file the location of records that satisfy some condition.

Each index entry matches a key value with one or more records.
An index that allows each entry to point to more than one record is called a secondary key index.

> Secondary key: one field or a combination of fields for which more than one record may have the same combination of values. Also called a non-unique key.

> join index: An index on columns from two or more tables that come from the same domain of values.

Hashed File Organization

A storage system in which the address of each record is determined using a hashing algorithm.

> Hashing algorithm: is a routine that converts a primary key value into a record address.

Typical hashing algorithm uses the technique of dividing each primary key value by a suitable prime number and then using the remainder of the division as the relative storage location.

> hash index table: a file organization that uses hashing to map a key into a location in an index, where there is a pointer to the actual data record matching the hash key.

> Pointer: a field of data indicating a target address that can be used to locate related field or record of data.

* an index table is smaller than a data table.
* Added expense of storing and maintaining the index space.

File organization overview

* Storage space
  * sequential: no wasted space
  * indexed: no wasted space for data but extra space for index
  * hashed: extra space may be needed to allow for addition and deletion of records after the initial set of records is loaded.
* sequential retrieval on primary key:
  * sequential: very fast
  * indexed: moderately fast
  * hashed: impractical, unless using hash index
* random retrieval on primary key
  * sequential: impractical
  * indexed: moderately fast
  * hashed: very fast
* multiple key retrieval
  * sequential: requires scanning whole file
  * indexed: very fast with multiple indexes
  * hashed: not possible unless using hash index
* deleting records:
  * sequential: can create wasted space or require reorganizing
  * indexed: if space can be dynamically allocated, easy. Requires maintenance of indexes
  * hashed: very eash
* adding new records:
  * sequential: requires rewriting a file
  * indexed: if space can be dynamically allocated, easy. Requires maintenance of indexes.
  * hashed: very easy. Hash collisions need extra work.
* updating records:
  * sequential: usually requires rewriting a file
  * indexed: easy but requires maintenance of indexes
  * hashed: very easy

Using and Selecting indexes

Creating a unique key index

```sql
CREATE UNIQUE INDEX customers_pk ON customers(id);
```

* name: `customers_pk`
* table: `customers`
* column: `id`

Creating a composite unique key.

```sql
CREATE UNIQUE INDEX line_items_pk ON line_items(order_id, product_id);
```

Creating a secondary (non unique) key index

```sql
CREATE INDEX products_fk ON orders(product_id);
```

When to use indexes

There is a trade-off between improved read performance and degraded write performance due to 
index maintenance.

Indexes can be used generously for systems that are read heavy but beware in systems that
are write heavy.

Guidelines:

1. indexes are most useful on larger tables
2. primary key should have a unique index.
3. indexes are useful for columns that appear in `WHERE` clauses.
4. Use indexes for columns that are referenced in `ORDER BY` and `GROUP BY` clauses.
5. Indexes are useful where there is a wide variety of values.
6. Indexes on large values is not optimal. Try to find a smaller coded value to index to represent the larger value.
7. If the key for the index is used to determine the location where the record will be stored then a surrogate key should be used.
8. Check DBMS for limits on the # of indexes on a table. (limits include # of indexes, and size of index)
9. Be careful of indexing attributes that have null values. Some systems don't index `null` so sequential scan is needed for `null` value.

### Designing a database for optimal query performance

Primary purpose of database design is to optimize the performance of database processing.

Parallel query processing

## Summary

De-normalization transforms normalized relations into non-normalized implementations specs.
Horizontal partitioning breaks a relation into multiple record specifications by placing different
rows into different tables, based on common column values.
Vertical partitioning distributes the columns of a relation into separate files, repeating the primary key in each of the files.

A physical file is a named portion of secondary memory for storing physical records.

A file organization arranges the records of a file on a secondary storage.

1. sequential
2. indexed (most popular)
3. hashed