diff options
| author | mo khan <mo.khan@gmail.com> | 2020-02-11 19:32:07 -0700 |
|---|---|---|
| committer | mo khan <mo.khan@gmail.com> | 2020-02-11 19:32:07 -0700 |
| commit | c5ffd3098047d4732b29ca6f7ea16270a78954c7 (patch) | |
| tree | 5d95099e58c1b14efefaea8a8dc77a980ad0a529 | |
| parent | 907e751d9360f545f890c38b54b3e4a0547b7d34 (diff) | |
Add notes for unit-7
| -rw-r--r-- | doc/unit-7.md | 303 |
1 files changed, 303 insertions, 0 deletions
diff --git a/doc/unit-7.md b/doc/unit-7.md index 523d758..cf471a0 100644 --- a/doc/unit-7.md +++ b/doc/unit-7.md @@ -40,3 +40,306 @@ Read the Chapter 9 sections: 3. load data. I think I just fell asleep for several pages. +This is not my jam. + + +> Data mart: A data warehouse that is limited in scope, +> whose data are obtained by selecting and summarizing data from a data warehouse or from separate extract, +> transform and load processes from source data systems. + +> Independent data mart: A data mart filled with data extracted from the operational environment, without +> the benefit of a data warehouse. + + +Independent data mart architecture + +1. A separate ETL process is developed for each data mart, which can yield costly redundant data and processing efforts. +2. Data marts may not be consistent with one another because they are often developed with different technologies, and thus they may not provide a clear enterprise-wide view of data concerning important subjects such as customers, suppliers and products. +3. There is no capability to drill down into greater detail or into related facts in other data marts or shared data repository, so analysis is limited, or at best very difficult. +4. Scaling costs are excessive because every new application that creates a separate data mart repeats all the extract and load scripts. +5. If there is an attempt to make the separate data marts consistent, the cost to do so is quite high. + +Value of indepdendent data marts: + +1. one debate deals with the nature of the phased approach to implementing a data warehousing environment. +2. The other debate deals with the suitable database architecture for analytical processing. + +> Dependent data mart: A data mart filled exclusively from an enterprise data warehouse and it's reconciled data. + +> Enterprise data warehouse (EDW): A centralized, integrated warehouse that is the control point and single source of all data made available to end users for decision support applications. + +The dependent data mart and operational data store is often called the "hub and spoke" approach in which the EDW +is the hub and the source data systems and the data marts are at ends of the input and output spokes. + + +> Operational data store (ODS): An integrated, subject-oriented, continuously updateable, current-valued, enterprise wide, detailed database designed to serve operational users as they do decision support processing. + + +Logical Data Mart and real-time data warehouse architecture + +* is practical for only moderate sized data warehouses or when using high performance data warehousing technology. + +> Logical data mart: a data mart created by a relational view of a data warehouse. + + +1. logical data mart is not physically separate databases but rather different relational views of one physical, slightly denormalized relational data warehouse. +2. Data are moed into data warehouse rather than to a separate staging area to utilize the high-performance computing power of the warehouse technology to perform the cleansing and transformation steps. +3. New data marts can be created quickly because no physical database or database technology needs to be created or acquired and no loading routines need to be written. +4. data marts are always up to date because data in a view are created when the view is referenced; views can be materialized if a user has a series of queries and analysis that need to work off the same instantiation of the data mart. + +> Real-time data warehouse: An enterprise data warehouse that accepts near-real-time feeds of transactional data from the systems of record, analyzes warehouse data, and in near-real-time relays business rules to the data warehouse and systems of record so that immediate action can be taken in response to business events. + + +Goal: + +* Capture customer data at the time of a business event. +* Analyze customer behaviour and predict customer responses to possible actions +* Develop rules for optimizing customer interactions, including the appropriate response and channel that will yield the best results. +* Take immediate action with customers at touch points based on best responses to customers as determined by decision rules in order to make desirable results happen. + + +Three-Layer Data Architecture + +1. Operational data are stored in the various operational systems of record throughout the organization. +2. Reconciled data are the type of data stored in the enterprise data warehouse and an operational data store. +3. Derived data are the type of data stored in each of the data marts. + +> Reconciled data: Detailed, current data intended to be the single, authoritative source for all decision support applications. + +> Derived data: Data that have been selected, formatted, and aggregated for end-user decision support applications. + + +Role of metadata + +1. Operational metadata: describe the data in the various operational systems that feed the enterprise data warehouse. +2. Enterprise data warehouse (EDW) metadata are derived from the enterprise data model. EDW metadata describe the reconciled data layer as well as the rules for extracting, transforming, and loading operational data into reconciled data. +3. Data mart metadata: describe the derived data layer and the rules for transforming reconciled data to derived data. + + + +Characteristics of data warehouse data + +> Transient data: in which changes to existing records are written over previous records, thus destroying the previous data content. + +> Periodic data: data that are never physically altered or deleted once they have been added to the store. + +Other data warehouse changes + +1. new descriptive attributes: +2. new business activity attributes +3. new classes of descriptive attributes +4. descriptive attributes become more refined +5. descriptive data are related to another another +6. new source of data + + +The Star Schema + +A star schema is a simple database design (suited for ad-hoc queries) in which dimensional data are +separated from fact or event data. + +A star schema is one version of a dimensional model. +Although the star schema is suited to ad hoc queries, it is not suited to online transaction +processing, and therefore is not generally used in operational systems, operational data +sources, or an EDW. + +It is called a star schema because of it's visual experience. + + +> Star schema: A simple database design in which dimensional data are separated from fact or event data. +> A dimensional model is another name for a star schema. + +Fact tables and dimension tables + +A star schema consists of two types of tables: + +1. one fact table. +2. one or more dimension tables. + +Fact tables contain factual or quantitative data. +Dimension tables hold descriptive data about the subjects of the business. + +Dimension tables are usually the source of attributes used to qualify, categorize, or summarize facts +in queries, reports or graphs; + +Dimension data are usually textual and discrete. + +A data mart might contain several star schemas with similar dimension tables but each with a different fact table. + +Typical dimensions: + +* product +* customer +* period or time. + +**time is always one of the dimensions** + + +* Each dimension table has a one-to-many relationship to the central fact table. +* Each dimension table generally has a simple primary key, as well as several nonkey attributes. +* The primary key is a foreign key in the fact table. +* The primary key of the fact table is a composite key that consists of the concatenation of all the foreign keys plus possible other components that do not correspond to dimensions. +* The relationship between each dimension table and the fact table provide a join path that allows users to query the database easily, using SQL statements for either predefined or ad hoc queries. + +The star schema is a denormalized implementation of the relational data model. + +The fact table plays the role of a normalized n-ary associative entity that links the instances of the various dimensions, which are in second, but possible not third, normal form. + +The dimension tables are denormalized. + +* Relationships between dimensions are not allowed; although such a relationship may exist in an organization. + + +Example: + + +| Dimension Table | +| Key 1 (PK) | ---- +| attribute | | +| attribute | | +| attribute | | + | + | + | | Fact Table | + -----> | Key 1 (PK)(FK) | + -----> | Key 2 (PK)(FK) | + | | Data column | + | | Data column | + | + | +| Dimension Table | | +| Key 2 (PK) | ---| +| attribute | +| attribute | +| attribute | + + +A star schema provides answers to a domain of business questions. + +E.g. + +1. Which cities have the highest sales of large products? +2. What is the average monthly sales for each store manager? +3. In which stores are we losing money on which products? Does this vary by quarter? + +Surrogate Key + +Every key used to join the fact table with a dimension table should be a surrogate (system assigned) key. + +Why? + +* business keys change, often slowly, over time and we need to remember old and new business key values for the same business object. +* using a surrogate key also allows us to keep track of different nonkey attribute values for the same product production key with several surrogate keys, each for the different package sizes. +* surrogate keys are often simpler and shorter +* surrogate keys can be of the same length and format for all keys + +Grain of the fact table + +The raw data of the star schema are kept in the fact table. +All data in a fact table are determined by the same combination of composite key elements; +Determining the lowest level of detailed fact data stored is arguably the most important and difficult data mart design step. + +The level of detail of this data is specified by the intersection of all of the components of the primary key of the fact table. + +This intersection of primary keys is called the grain of the fact table. + +> Grain: the level of detail in a fact table, determined by the intersection of all the components of the primary key, including all foreign keys and any other primary key elements. + +A common grain would be each business transaction, such as an individual line item or an individual scanned item on a product sales receipt, a personall change order, a line item on a material receipt, a claim against an insurance policy, a boarding pass, or an individual ATM transaction. + + +Clicks on a website is possibly the lowest level of granularity. + + +Multiple fact tables + +> Conformed dimensions: One or more dimension tables associated with two or more fact tables for which the dimension tables have the same business meaning and primary key with each fact table. + + +10 rules for dimension modeling + +1. Use atomic facts: eventually, users want detailed data, even if their initial requests are for summarized facts. +2. Create single-process fact tables: each fact table should address the important measurements for one business process, such as taking a customer order or placing a material purchase order. +3. Include a date dimension for every fact table: A fact should be described by the characteristics of the associated day date/time to which that fact is related. +4. Enforce consistent grain: Each measurement in a fact table must be atomic for the same combination of keys. +5. Disallow null keys in fact tables: Facts apply to the combination of key values, and helper tables may be needed to represent some M:N relationships. +6. Honor hierarchies: understand the hierarchies of dimensions and carefully choose to snowflake the hierarchy or denormalize into one dimension. +7. Decode dimension tables: Store descriptions of surrogate keys and codes used in fact tables in associated dimension tables, which can then be used to report labels and query filters. +8. Use surrogate keys: All dimension table rows should be identified by a surrogate key, with descriptive columns showing the associated production and source system keys. +9. Conform dimensions: Conformed dimensions should be used across multiple fact tables. +10. Balance requirements with actual data: Unfortunately, source data may not precisely support all business requirements, so you must balance what is technically possible with what users want and need. + + +Column databases: A new alternative for data warehouses + +* column database technologies trade off storage space savings for compute time. + + +The User Interface + +* traditional query and reporting tools +* OLAP, MOLAP and ROLAP tools +* Data visualization tools +* Business performance management and dashboard tools +* Data-mining tools + +Traditional query and reporting tools include spreadsheets, personal computer databases and report writers and generators. + + +Role of metadata + +The first requirement for building a user-friendly interface is a set of metadata that describes the data. + +The metadata associated with the data marts are often referred to as a "data catalog", "data directory" +or some similar term. + +Metadata should allow users to answer questions like: + +* what subjects are described in the data mart? +* what dimensions and facts are included in the data mart? +* How are the data in the data mart derived from the enterprise data warehouse data? +* How are the data in the EDW derived from operational data? +* What reports and predefined queries are available to view the data? +* What drill down and other data analysis techniques are available? +* who is responsible for the quality of data in the data marts, and to whome are requests for changes made? + + +SQL OLAP Quering + +The heart of the analytical queries is the ability to perform: + +* categorization +* aggregation +* ranking + + +* WINDOW OVER +* PARTITION BY OVER +* QUALIFY +* RANK + +E.g. + +```sql +SELECT id, quarter, sales, AVG(sales) +OVER ( + PARTITION BY id ORDER BY quarter ROWS 2 PRECEDING +) AS 3QtrAverage +FROM sales_histories; +``` + +> Online analytical processing (OLAP): the user of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques. + +> ROLAP: Relational OLAP +> MOLAP: Multidimensional OLAP + + +> Data Visualization: The representation of data in graphical and multimedia formats for human analysis. + +> Data mining: knowledge discovery, using a sophisticated blend of techniques from traditional statistics, artificial intelligence, and computer graphics. + +Goals of data mining: + +* Explanatory: explain some observed event or condition, such as why sales of pickup trucks have increased in Colorado. +* Confirmatory: confirm a hypothesis, such as whether two-income families are more likely to buy family medical coverage than single-income families. +* Exploratory: analyze data for new or unexpected relationships, such as what spending patterns are likely to accompany credit card fraud. |
