Topic 10 : Summary Tables
Practice

Objectives

 

This practice introduces the concepts of aggregate data and the use of summary tables to hold this data. Summary tables or summaries store presummarized (or preaggregated) data. Calculations are performed on low-level data and placed into tables that store the pre-calculated results. In this way, summary tables provide pre-calculated answers to known queries.

 

Identifying and implementing summaries in the warehouse is critical to providing timely response to analytical queries. Effective use of summaries is the single most important technique for improving performance in data warehouses. Because most decision support queries call for summarization (or aggregation) of data elements, using summaries in your design, eliminates the need to repeat resource intensive calculations and avoids slow and costly GROUP BY operations by individual queries.

 

Design Alternatives

 

Summaries can either be stored in:

 

Multiple fact tables use the constellation (or galaxy), or the snowflake configuration.

 

One large fact table can be accommodated within a snowflake or star schema. The access and update requirements aid you in determining the right model for your warehouse.

 

Constellation Configuration

 

The constellation configuration consists of a central atomic fact table where the base grain data is stored. Conceptually, surrounding this central star are other stars with summarized dimensions, also called derivative dimensions, and summary fact tables. Dimensions may or may not be shared in this configuration.

 

Constellation Configuration Advantages

 

 

Constellation Configuration Disadvantage

 

 

Snowflake Configuration

 

The snowflake configuration enables the dimension tables to be shared. Normalizing the dimensions is required so that the different levels within the dimensions can be used in the summarized fact tables. In the example above, you see that each of the dimensions, Product (prod), Time, Customer (cust), and Store are further normalized, for example Customer dimension to Customer Classification (class).

 

One Large Fact Table

 

Using one big fact table can enable you to include the summary information with the detail data in the same table. The keys are not as obvious as keys that would be used in the case of multiple tables. The key management alternatives are the following:

 

You do not need a composite key in the fact table, but you might want to include level information in the dimension table. Opting for the level indicator may result ion better performance.

 

One Large Fact Table Advantages

 

All summarized data is in one location, the query process does not need knowledge of where the summaries are stored–aggregation navigation is not required.

 

One Large Fact Table Disadvantages

 

Choosing Summaries

 

Few implementations are able to manage all possible combinations. More important, not all combinations address the information and reporting needs of the end user.

There are two schools of thought in the selection of dimension and levels for summarization:

 

Technical Constraints

 

Summarizing on every possible combination achieves the best performance, assuming you can retrieve to the summarized data. However, the cost in performance and maintenance is substantial.

 

Creating all possible combinations requires tremendous resources for:

 

Restating Summaries

 

Summaries are restated when the data is refreshed. The batch load window may limit the resources available to restate the summary tables to keep them consistent with the raw data. There are strategies to manage the summary table update outside of the load window– using mirrored data, for example. Applying rigor to the analysis and selection of summary tables at the outset optimizes the resource use while providing acceptable performance levels to the users.

 

 

 

 

Instructions

 

In this practice, you:

 


 

1. Determine the number of levels in each dimension table.

 

(Hint: The Id columns in the dimension table represent level.)

 

 

2. Determine the number of summary tables you can possibly have in your model.

 

(Hint: Multiply the number of level in each dimension, and subtract 1 for the base table.)

 

 

3. Recommend a summary table strategy and the resulting architecture

 

 

 

 

Topic 10 Solutions

 

1. Determine the number of levels in each dimension table.

 

 

 

2. Determine the number of summary tables you can possibly have in your model.

 

If you are not including the lookup table for package, then the formula is as follows:

If you are including the lookup table for package, then the formula is as follows:

 

 

3. Recommend a summary table strategy and the resulting architecture

 

Creating summaries for all combinations is possible considering the size of this database. Developing a snowflake schema for the summaries tables is recommended, because it will give you the optimal performance and additional drill down capabilities.

Practice: Database Sizing

Objectives

In this practice, you examine some of the different approaches that can size the data warehouse database. In addition you consider some of the other storage considerations that you need to make to calculate how much storage space is needed overall.

 

Sizing the Database and Other Storage Requirements

 

A major factor in capacity and space planning is the physical size of the data warehouse database. The considerations for sizing require that you determine the amount of physical disk required for the data warehouse database, the tables, views, and indexes. For example, determine the amount of physical disk space required for:

 

Sizing the database is not an exact science. The technique employed varies between implementations. There are many different approaches that can be suggested, you need to consider an approach that is going to suit your implementation.

 

Sizing the Database

 

Database sizing is not an exact science. There are different techniques, and each one should be judged by its merits and suitability.

 

When sizing the database consider:

 

When sizing for overall storage consider the implications of:

 

Grain

 

Ultimately the grain of the data in the data warehouse determines the database size. If your warehouse is to contain daily transaction level data (high grain) then you need more space than say weekly summarized data (low grain).

 

Typically the fact table stores the high grain data, the summary tables hold the low grain data.

The grain determines the level of detail you are able to provide to your users. The granularity when analyzing business requirements

 

Sparsity and Design

 

Sparsity is a feature you must consider in your design, because sparsity can lead to poor query performance. Sparsity can affect your sizing estimates: high sparsity results typically in overestimate of space. High density, low sparsity, results in more accurate space estimates using standard algorithms.

 

Dimensions Increase Sparsity

 

The bottom line is that the more dimensions you add to your model, the more sparse the data becomes, because fewer and fewer combinations are valid. There is no limit to the number of dimensions you might employ. However managing the resulting sparsity is a key to providing optimal performance.

 

A factor for sparsity should be applied to any database sizing.

 

Sizing Techniques

 

Technique 1: Average Line Item

 

The following calculation, shows how to estimate the amount of direct access storage device (DASD) needed for three years worth of data, using an average sales price, and assuming line item level grain.

 


Technique 2: Raw Estimates

 

This calculation is performed for each table:

 


Technique 3: Test Load Sampling

 

A good approach to sizing is based on the analysis of a statistically significant sample of the data.

 

Test loads can be performed on data from a day, a week, a month, or any other period of time. You must ensure the sample periods reflect the true day to day operations of your company, and the results include any seasonality issues, or other factors such as worst case scenarios which may otherwise prejudice the results. Once you have determined the number of transactions based on the sample, then you calculate the size using Technique 1, the Average Sales Price approach.

 

You must consider the impact of:

Execute this query to determine the average number of rows per database block:

 

For Oracle7

 

select AVG(COUNT(SUBSTR(ROWID,10,4)))

from <TABLE>

where ROWNUM <= <TOTAL-#-ROWS-MINUS-A-FEW>

group by SUBSTR(ROWID,1,8), SUBSTR(ROWID,15,4);

For Oracle8

 

 

 

select AVG(COUNT(DBMS_ROWID.ROWID_ROW_NUMBER(ROWID)))

from <TABLE>

where ROWNUM <= <TOTAL-#-ROWS-MINUS-A- FEW>

group by DBMS_ROWID.ROWID_OBJECT(ROWID),

DBMS_ROWID.ROWID_RELATIVE_FNO(ROWID),

DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID);

 

Technique 4: Rows, Frequency, and Growth

 

Another approach might be for each table, fact, dimension, and summary enter the table name, number of rows (NR), and total row length (TRL). NR x TRL = total size in bytes. Then multiply by the frequency of load. This gives you the total space required per table.

Total the space required for all tables, and apply any growth factors.

 

 

Instructions

 

In this practice, you:

 

1. Using the average line item approach

 

a. Calculate the approximate size of dimensional tables

(Hint: <Number of Rows> x <Row length> in bytes.)

b. Calculate the approximate size of the fact table.

(Hint: <Rows in dimension(1)> x <Rows in dimension(2)> x <Rows in dimension(n)> x <Row length in fact table>.)

 

 

2. List three shortcomings of this approach.

 

 

3. Recommend the most accurate approach to sizing your warehouse.

 

(Hint: List the methods in the order that you might employ each - if at all. Explain addition+al adjustments that you would make.)

 

 

 

 

 

Topic 12 Solutions

 

 

1. Using the average line item approach

 

a. Calculate the approximate size of dimensional tables

b. Calculate the approximate size of the fact table.

 

 

2. List three shortcomings of this approach.

 

 

3. Recommend the most accurate approach to sizing your warehouse.

 

1