## Question 1

Consider the following EER diagram for the Royal Victoria Hospital (RVH) database.

![EER Diagram](./question-1-eer.png)

Data volume and access for this diagram are as follows:

* There are 1 000 patients and 500 items yielding a total of 10 000 usage records in the database.
* There are 50 physicians and a total of 4 000 prescriptions in the database.
* There are 200 treatments in the database.
* There are 50 accesses per day for patient records; of these, 30 request access to both prescription and usage records.
* There are 20 accesses per day for physician records; of these, 20 request access to prescriptions.
* There are 50 accesses per day to item records; of these, 10 request access to usage records.
* There are 5 direct accesses per day for treatment records.
* Of the total accesses to prescription records, 20 request access to patients, 30 request access to physicians, and 35 request access to treatment.
* Of the total accesses to usage records, 10 request access to patients and 30 request access to items.

Draw a composite usage map for the RVH database.

![Composite Usage Map](./composite-usage-map-800.jpg)

## Question 2

Answer the following questions (250 words max/question).

**What are the typical integrity controls performed in both data integrity and referential integrity?**

* default value: ensure each field start with a default value if one isn't provided
* range control: ensures that a field falls within a specific range of values
* null value control: ensures that `null` values are not allowed to be inserted in a fields position.
* referential integrity control: ensures that foreign keys to other tables exist ensuring orphaned records cannot be created.

**Using an example for each situation, illustrate the three common situations that suggest relations should be denormalized.**

1. Two entities with one-to-one relationship
  ```text
    User(id: integer, username: string, password_digest: string, profile_id: integer)
    Profile(id: integer, first_name: string, last_name: string, email: string)
  ```

  If a web site displays the currently logged in users `username` and `email` address
  on all pages, then it might make sense to de-normalize this data to reduce the need for
  an INNER JOIN on the two tables to render every screen.

  ```text
    User(id: integer, username: string, password_digest: string, first_name: string, last_name: string, email: string)
  ```

2. A many-to-many relationship with nonkey attributes
  ```text
    Order(id: integer, customer_id: integer)
    LineItem(id: integer, product_id: integer, order_id: integer, quantity: integer, purchase_price: decimal)
    Product(id: integer, name: string, description: text, current_price: decimal)
  ```

  In order to display a customers order history it might make sense to denormalize the
  relationships to make it easy to see the product name and description to capture these fields 
  as they were at the time of purchase.

  This would ensure that the order history can display the name and description of the product
  at the time of purchase rather than the current product name and description. The current
  product name and description can change and may include or exclude information that was
  available at the time of purchase.

  ```text
    Order(id: integer, customer_id: integer)
    LineItem(id: integer, product_id: integer, order_id: integer, quantity: integer, purchase_price: decimal, name: string, description: text)
  ```

3. Reference data
  ```text
    User(id: integer, username: string, password_digest: string, avatar_id: integer)
    Avatar(id: integer, url: string)
  ```

  In this scenario each user in the system can have an associated avatar to display as a profile picture.
  Instead of loading a separate record via an association to get the url to the profile image to display
  it may make more sense to add the avatar url to the users relation.

  ```text
    User(id: integer, username: string, password_digest: string, avatar_url: string)
  ```

**What are the advantages and disadvantages of horizontal and vertical partitioning?**

* Horizontal Partitioning
  * Advantages:
    1. Efficiency: Data queried together are stored close to one another and separate from data not used together.
    2. Local optimization: Each partition of data can be stored to optimize performance for it's own use.
    3. Security: Data not relevant to one group of users can be segregated from data those users are allowed to use.
    4. Recovery and uptime: Smaller files take less time to back up and recover and other files are still accessible if one file is damaged, so the effects of damage are isolated.
    5. Load balancing: Files can be allocated to different storage areas, which minimizes contention for access to the same storage area or even allows for parallel access to the different areas.
  * Disadvantages:
    1. Inconsistent access speed: Different partitions may have different access speeds, thus confusing users. If data needs to be accessed across partitions, this produces slower response times.
    2. Complexity: is usually not transparent to application code, which requires knowledge of which partition to read/write from.
    3. Extra space and update time: Data may be duplicated across partitions, taking extra storage space compared to all data in normalized files. Updates that affect multiple partitions will be slower.
* Vertical Partitioning
  * Advantages:
    1. Efficiency: Data queried together are stored close to one another and separate from data not used together.
  * Disadvantages:
    1. More joins are needed in order to gather the required data reducing query speed or increasing the need for more indexes which increases the amount of needed disk space.

## Question 3

Answer the following questions (250 words max/question).

**What factors should be considered when choosing a file organization?**

The following is a list of general guidelines to consider when choosing a file organization.

1. fast data retrieval
2. high throughput for processing data input and maintenance transactions
3. efficient use of storage space
4. protection from failures or data loss
5. minimizing need for reorganization
6. accommodating growth
7. security from unauthorized use

The needs of the application and the distribution of the software may also raise the priority of
some of the points raise above.

In a system where the data might be shipped as an embedded database into a resource restricted
environment such as an on-premises appliance factors such as disk space may have a higher priority
than protecting from failures or data loss.

In a system where the write traffic significantly outweighs the read traffic you may want
to consider giving up efficient use of storage space in favour of high throughput for processing
data input and accommodating growth. In such a system it might be okay to allow data
to be eventually consistent by partitioning the data across nodes in a ring of servers.

In a system where the speed of retrieving the data is the most important factor such as a
cache then you might be okay with storing the data in memory rather than persisting it to disk.
This is a common requirement for a cache.

**What is the purpose of clustering data in a file?**

Clustering data allows for sequential access of related data to improve the read
performance of the desired data. It reduces the need to find related information be
accessing multiple files and therefore speeds up the response time for the data.

**Compare hashed file organization versus indexed file organization. List two advantages of indexed over hashed, and two advantages of hashed over indexed.**

Hashed file advantages over indexed file

1. Allows for O(1) lookup to find data, which can mean much faster access time to access a specific piece of data.
2. Deleting data is very easy

Indexed file advantages over hashed file

1. Allows for efficient access to multiple related pieces of data.
2. No wasted space for data. Additional space is required to maintain the index.
3. Lower CPU overhead than computing a hash key for each record and hash collisions are not a concern.

## Question 4

Consider the following database:

```text
Employee(emp-no, name, department, salary)
ProjAssigned(emp-no, proj-no, worked-hours)
```

```sql
CREATE TABLE Employee(
  "emp-no" bigint primary key,
  name varchar(255),
  department varchar(255),
  salary decimal
);

CREATE TABLE ProjAssigned(
  "proj-no" bigint primary key,
  "emp-no" bigint references Employee("emp-no"),
  "worked-hours" integer
);
```

**Write one SELECT SQL query to list the numbers and names of all employees with a salary greater than 66 000 who are assigned to projects, the projects they are assigned to, and the corresponding hours worked. Your list should be sorted by employee name.**

```sql
SELECT e."emp-no", e.name, pa."proj-no", pa."worked-hours"
FROM Employee e
INNER JOIN ProjAssigned pa ON pa."emp-no" = e."emp-no"
WHERE e.salary > 66000.0
ORDER BY e.name ASC;
```

**Define indexes on selected attributes to speed up your query, and justify your selections.**

* An index on the `emp-no` column on the `ProjAssigned` table will improve the join performance.
* An index on the `salary` column will improve the filter performance.
* An index on the `name` column will improve the order by performance.
* An addition index on the primary key columns is not needed because primary keys already have a unique index.

**Write SQL queries to create the indexes you defined above.**

```sql
CREATE INDEX employees_idx ON ProjAssigned("emp-no");
CREATE INDEX employee_salary_idx ON Employee(salary);
CREATE INDEX employee_names_idx ON Employee(name ASC);
```

## Question 5

Consider the following three relations:

```text
TRAVEL_AGENT (name, age, salary)
CUSTOMER (name, departure_city, destination, journey_class)
TRANSACTION (number, cust_name, travel_agent_name, amount_paid)
```

```sql
CREATE TABLE TravelAgent(
  name varchar(255),
  age integer,
  salary decimal
);
CREATE TABLE Customer(
	name varchar(255),
	departure_city varchar(255),
	destination varchar(255),
	journey_class varchar(255)
);
CREATE TABLE Transaction(
	"number" integer,
	cust_name varchar(255),
	travel_agent_name varchar(255),
	amount_paid decimal
);
```

Write SQL statements to answer the following questions.

**Compute the number of different customers who have a transaction.**

```sql
SELECT COUNT(DISTINCT c.name)
FROM Customer c
INNER JOIN Transaction t ON t.cust_name = c.name;
```

**Display the name of the oldest travel agent.**

```sql
SELECT t.name
FROM TravelAgent t
WHERE t.age IN (
  SELECT MAX(age) from TravelAgent
);
```

**List the total number of transactions for each travel agent. Consider only those transactions where the amount paid exceeds 1 000.**

```sql
SELECT travel_agent_name, COUNT(travel_agent_name)
FROM Transaction
WHERE amount_paid > 1000.0
GROUP BY travel_agent_name
ORDER BY travel_agent_name;
```

**Display the names and ages of the travel agents who have arranged journeys for customer “John Smith”, in descending order of age (use a subquery).**

```sql
SELECT name, age
FROM TravelAgent
WHERE name IN (
  SELECT DISTINCT(travel_agent_name)
  FROM Transaction
  WHERE cust_name = 'John Smith'
)
ORDER BY age DESC;
```

**Display the names and ages of travel agents who have arranged journeys for customer “John Smith”, in descending order of age (do not use a subquery).**

```sql
SELECT DISTINCT(ta.name), ta.age
FROM TravelAgent ta
INNER JOIN Transaction t ON ta.name = t.travel_agent_name
WHERE t.cust_name = 'John Smith'
ORDER BY ta.age DESC;
```

**Display the age of travel agents who have arranged journeys for customer “John Smith” to “Ottawa” (use a subquery).**

```sql
SELECT age
FROM TravelAgent
WHERE name IN (
  SELECT DISTINCT(travel_agent_name)
  FROM Transaction
  WHERE cust_name IN (
    SELECT DISTINCT(name)
    FROM Customer
    WHERE name = 'John Smith'
    AND destination = 'Ottawa'
  )
);
```

**Display the age of travel agents who have arranged journeys for customer “John Smith” to “Ottawa” (do not use a subquery).**

```sql
SELECT DISTINCT(ta.age)
FROM TravelAgent ta
INNER JOIN Transaction t ON t.travel_agent_name = ta.name
INNER JOIN Customer c ON c.name = t.cust_name
WHERE c.destination = 'Ottawa'
AND c.name = 'John Smith';
```

**Display the names and salaries of all travel agents who did not arrange journeys for customer “John Smith”, in ascending order of salary.**

```sql
SELECT Distinct(ta.name), ta.salary
FROM TravelAgent ta
WHERE ta.name NOT IN (
  SELECT travel_agent_name
  FROM Transaction
  WHERE cust_name = 'John Smith'
)
ORDER BY ta.salary ASC;
```

**Display the names of travel agents who have five or more transactions.**

```sql
SELECT DISTINCT(t.travel_agent_name)
FROM Transaction t
GROUP BY t.travel_agent_name
HAVING COUNT(t.travel_agent_name) >= 5;
```

**Display the names of all travel agents who have arranged at least ten journeys to "Ottawa".**

```sql
SELECT DISTINCT(t.travel_agent_name)
FROM Transaction t
INNER JOIN Customer c on t.cust_name = c.name
WHERE c.destination = 'Ottawa'
GROUP BY t.travel_agent_name
HAVING COUNT(t.travel_agent_name) >= 10;
```