summaryrefslogtreecommitdiff
path: root/doc/unit-7.md
blob: 4da9564d47c9227146fd2db85de2c4488ebfbf17 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
# Unit 7: Introduction to Data Warehousing

## Section 1 - Basic Concepts of Data Warehousing

* Read "Chapter 9: Basic concepts of data warehousing"

* data warehouse: a subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes.

* subject oriented: organized around key subjects. e.g. customers, patients, students, products, and time.
* integrated: data housed in the data warehouse are defined using consistent naming conventions, formats, encoding structures and related characteristics gathered from multiple sources.
* time-variant: data in the data warehouse contain a time dimension so that they may be used to study trends and changes.
* non-updatable: loaded and refreshed from operational systems but cannot be updated by end users.

## Section 2 - Data Warehouse Architectures and OLAP Tools

Read the Chapter 9 sections:

* "Data warehouse architectures," 
* "Some characteristics of data warehouse data," 
* "The derived data layer," 
* "Column databases: A new alternative for data warehouses,"
* "The user interface"
* "SQL OLAP querying," and 
* "Data visualization"

* describe the main architectures used to build data warehouse systems;
* list and explain the basic characteristics and types of data in data warehousing systems;
* explain the main concepts involved in the design of data marts;
* describe the process used to generate derived data;
* describe the star schema, as used in data marts;
* present a variety of tools and techniques that are used to query, analyze, and visualize the data stored in data warehouses and data marts.

### Data Warehouse architectures

* three-level architecture: bottom up, incremental approach
* three-level architecture: top-down approach emphasis on coordination and an enterprise-wide perspective.

1. extract data from internal/external sources. (daily, weekly, monthly)
2. transform data.
3. load data.

I think I just fell asleep for several pages.
This is not my jam.


> Data mart: A data warehouse that is limited in scope,
> whose data are obtained by selecting and summarizing data from a data warehouse or from separate extract,
> transform and load processes from source data systems.

> Independent data mart: A data mart filled with data extracted from the operational environment, without
> the benefit of a data warehouse.


Independent data mart architecture

1. A separate ETL process is developed for each data mart, which can yield costly redundant data and processing efforts.
2. Data marts may not be consistent with one another because they are often developed with different technologies, and thus they may not provide a clear enterprise-wide view of data concerning important subjects such as customers, suppliers and products.
3. There is no capability to drill down into greater detail or into related facts in other data marts or shared data repository, so analysis is limited, or at best very difficult.
4. Scaling costs are excessive because every new application that creates a separate data mart repeats all the extract and load scripts.
5. If there is an attempt to make the separate data marts consistent, the cost to do so is quite high.

Value of independent data marts:

1. one debate deals with the nature of the phased approach to implementing a data warehousing environment.
2. The other debate deals with the suitable database architecture for analytical processing.

> Dependent data mart: A data mart filled exclusively from an enterprise data warehouse and it's reconciled data.

> Enterprise data warehouse (EDW): A centralized, integrated warehouse that is the control point and single source of all data made available to end users for decision support applications.

The dependent data mart and operational data store is often called the "hub and spoke" approach in which the EDW
is the hub and the source data systems and the data marts are at ends of the input and output spokes.


> Operational data store (ODS): An integrated, subject-oriented, continuously updatable, current-valued, enterprise wide, detailed database designed to serve operational users as they do decision support processing.


Logical Data Mart and real-time data warehouse architecture

* is practical for only moderate sized data warehouses or when using high performance data warehousing technology.

> Logical data mart: a data mart created by a relational view of a data warehouse.


1. logical data mart is not physically separate databases but rather different relational views of one physical, slightly denormalized relational data warehouse.
2. Data are moed into data warehouse rather than to a separate staging area to utilize the high-performance computing power of the warehouse technology to perform the cleansing and transformation steps.
3. New data marts can be created quickly because no physical database or database technology needs to be created or acquired and no loading routines need to be written.
4. data marts are always up to date because data in a view are created when the view is referenced; views can be materialized if a user has a series of queries and analysis that need to work off the same instantiation of the data mart.

> Real-time data warehouse: An enterprise data warehouse that accepts near-real-time feeds of transactional data from the systems of record, analyzes warehouse data, and in near-real-time relays business rules to the data warehouse and systems of record so that immediate action can be taken in response to business events.


Goal:

* Capture customer data at the time of a business event.
* Analyze customer behaviour and predict customer responses to possible actions
* Develop rules for optimizing customer interactions, including the appropriate response and channel that will yield the best results.
* Take immediate action with customers at touch points based on best responses to customers as determined by decision rules in order to make desirable results happen.


Three-Layer Data Architecture

1. Operational data are stored in the various operational systems of record throughout the organization.
2. Reconciled data are the type of data stored in the enterprise data warehouse and an operational data store.
3. Derived data are the type of data stored in each of the data marts.

> Reconciled data: Detailed, current data intended to be the single, authoritative source for all decision support applications.

> Derived data: Data that have been selected, formatted, and aggregated for end-user decision support applications.


Role of metadata

1. Operational metadata: describe the data in the various operational systems that feed the enterprise data warehouse.
2. Enterprise data warehouse (EDW) metadata are derived from the enterprise data model. EDW metadata describe the reconciled data layer as well as the rules for extracting, transforming, and loading operational data into reconciled data.
3. Data mart metadata: describe the derived data layer and the rules for transforming reconciled data to derived data.



Characteristics of data warehouse data

> Transient data: in which changes to existing records are written over previous records, thus destroying the previous data content.

> Periodic data: data that are never physically altered or deleted once they have been added to the store.

Other data warehouse changes

1. new descriptive attributes: 
2. new business activity attributes
3. new classes of descriptive attributes
4. descriptive attributes become more refined
5. descriptive data are related to another another
6. new source of data


The Star Schema

A star schema is a simple database design (suited for ad-hoc queries) in which dimensional data are
separated from fact or event data.

A star schema is one version of a dimensional model.
Although the star schema is suited to ad hoc queries, it is not suited to online transaction
processing, and therefore is not generally used in operational systems, operational data
sources, or an EDW.

It is called a star schema because of it's visual experience.


> Star schema: A simple database design in which dimensional data are separated from fact or event data.
> A dimensional model is another name for a star schema.

Fact tables and dimension tables

A star schema consists of two types of tables:

1. one fact table.
2. one or more dimension tables.

Fact tables contain factual or quantitative data.
Dimension tables hold descriptive data about the subjects of the business.

Dimension tables are usually the source of attributes used to qualify, categorize, or summarize facts
in queries, reports or graphs;

Dimension data are usually textual and discrete.

A data mart might contain several star schemas with similar dimension tables but each with a different fact table.

Typical dimensions:

* product
* customer
* period or time.

**time is always one of the dimensions**


* Each dimension table has a one-to-many relationship to the central fact table.
* Each dimension table generally has a simple primary key, as well as several nonkey attributes.
* The primary key is a foreign key in the fact table.
* The primary key of the fact table is a composite key that consists of the concatenation of all the foreign keys plus possible other components that do not correspond to dimensions.
* The relationship between each dimension table and the fact table provide a join path that allows users to query the database easily, using SQL statements for either predefined or ad hoc queries.

The star schema is a denormalized implementation of the relational data model.

The fact table plays the role of a normalized n-ary associative entity that links the instances of the various dimensions, which are in second, but possible not third, normal form.

The dimension tables are denormalized.

* Relationships between dimensions are not allowed; although such a relationship may exist in an organization.


Example:

```text
| Dimension Table |
| Key 1 (PK)      | ----
| attribute |          |
| attribute |          |
| attribute |          |
                       |
                       |
                       |      | Fact Table |
                       -----> | Key 1 (PK)(FK) |
                       -----> | Key 2 (PK)(FK) |
                       |      | Data column |
                       |      | Data column |
                       |
                       |
| Dimension Table |    |
| Key 2 (PK)      | ---|
| attribute |
| attribute |
| attribute |
```

A star schema provides answers to a domain of business questions.

E.g.

1. Which cities have the highest sales of large products?
2. What is the average monthly sales for each store manager?
3. In which stores are we losing money on which products? Does this vary by quarter?

Surrogate Key

Every key used to join the fact table with a dimension table should be a surrogate (system assigned) key.

Why?

* Business keys change, often slowly, over time and we need to remember old and new business key values for the same business object.
* Using a surrogate key also allows us to keep track of different non-key attribute values for the same product production key with several surrogate keys, each for the different package sizes.
* Surrogate keys are often simpler and shorter
* Surrogate keys can be of the same length and format for all keys

Grain of the fact table

The raw data of the star schema are kept in the fact table.
All data in a fact table are determined by the same combination of composite key elements;
Determining the lowest level of detailed fact data stored is arguably the most important and difficult data mart design step.

The level of detail of this data is specified by the intersection of all of the components of the primary key of the fact table.

This intersection of primary keys is called the grain of the fact table.

> Grain: the level of detail in a fact table, determined by the intersection of all the components of the primary key, including all foreign keys and any other primary key elements.

A common grain would be each business transaction, such as an individual line item or an individual scanned item on a product sales receipt, a personal change order, a line item on a material receipt, a claim against an insurance policy, a boarding pass, or an individual ATM transaction.


Clicks on a website is possibly the lowest level of granularity.

Size of the fact table

The grain and duration of the fact table have a direct impact on the size of that table.

We can estimate the number of rows in the fact table as follows:

1. Estimate the number of possible values for each dimension associated with the fact table.
2. Multiple the values obtained in the first step after making any necessary adjustments

Total rows = 1000 stores * (10,000 products * 50% = 5,000) * 24 months
 = 120,000,000 rows

Multiple fact tables

> Conformed dimensions: One or more dimension tables associated with two or more fact tables for which the dimension tables have the same business meaning and primary key with each fact table.


10 rules for dimension modeling

1. Use atomic facts: eventually, users want detailed data, even if their initial requests are for summarized facts.
2. Create single-process fact tables: each fact table should address the important measurements for one business process, such as taking a customer order or placing a material purchase order.
3. Include a date dimension for every fact table: A fact should be described by the characteristics of the associated day date/time to which that fact is related.
4. Enforce consistent grain: Each measurement in a fact table must be atomic for the same combination of keys.
5. Disallow null keys in fact tables: Facts apply to the combination of key values, and helper tables may be needed to represent some M:N relationships.
6. Honor hierarchies: understand the hierarchies of dimensions and carefully choose to snowflake the hierarchy or denormalize into one dimension.
7. Decode dimension tables: Store descriptions of surrogate keys and codes used in fact tables in associated dimension tables, which can then be used to report labels and query filters.
8. Use surrogate keys: All dimension table rows should be identified by a surrogate key, with descriptive columns showing the associated production and source system keys.
9. Conform dimensions: Conformed dimensions should be used across multiple fact tables.
10. Balance requirements with actual data: Unfortunately, source data may not precisely support all business requirements, so you must balance what is technically possible with what users want and need.


Column databases: A new alternative for data warehouses

* column database technologies trade off storage space savings for compute time.


The User Interface

* traditional query and reporting tools
* OLAP, MOLAP and ROLAP tools
* Data visualization tools
* Business performance management and dashboard tools
* Data-mining tools

Traditional query and reporting tools include spreadsheets, personal computer databases and report writers and generators.

Role of metadata

The first requirement for building a user-friendly interface is a set of metadata that describes the data.

The metadata associated with the data marts are often referred to as a "data catalog", "data directory"
or some similar term.

Metadata should allow users to answer questions like:

* What subjects are described in the data mart?
* What dimensions and facts are included in the data mart?
* How are the data in the data mart derived from the enterprise data warehouse data?
* How are the data in the EDW derived from operational data?
* What reports and predefined queries are available to view the data?
* What drill down and other data analysis techniques are available?
* Who is responsible for the quality of data in the data marts, and to whom are requests for changes made?

SQL OLAP Quering

The heart of the analytical queries is the ability to perform:

* categorization
* aggregation
* ranking


* WINDOW OVER
* PARTITION BY OVER
* QUALIFY
* RANK

E.g.

```sql
SELECT id, quarter, sales, AVG(sales)
OVER (
  PARTITION BY id ORDER BY quarter ROWS 2 PRECEDING
) AS 3QtrAverage
FROM sales_histories;
```

> Online analytical processing (OLAP): the user of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques.

> ROLAP: Relational OLAP
> MOLAP: Multidimensional OLAP


> Data Visualization: The representation of data in graphical and multimedia formats for human analysis.

> Data mining: knowledge discovery, using a sophisticated blend of techniques from traditional statistics, artificial intelligence, and computer graphics.

Goals of data mining:

* Explanatory: explain some observed event or condition, such as why sales of pickup trucks have increased in Colorado.
* Confirmatory: confirm a hypothesis, such as whether two-income families are more likely to buy family medical coverage than single-income families.
* Exploratory: analyze data for new or unexpected relationships, such as what spending patterns are likely to accompany credit card fraud.