Transbase
Hypercube and Data Warehousing Guide

Version V8.3

While every precaution has been taken in the preparation of this document, the publisher assumes no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.

2022-11-30

Table of Contents

Introduction
1. Data Warehouses
- 1.1. Data Warehouse Components
- 1.2. Data Warehouse Implementation
2. Basics of Transbase Hypercube
- 2.1. UB-Tree
- 2.2. Clustering of Hierarchies
3. MHC in Transbase
- 3.1. Sample Schema
- 3.2. The User's View: Hierarchy Specification by extended DDL
4. Implementing HC in Transbase
5. Maintenance of MHC
6. Query Processing with MHC
7. Practical Issues
8. Conclusion
A. References

List of Figures

1.1. Star Schema
1.2. Sample Star Schema
1.3. Snowflake Schema
1.4. Sample Snowflake Schema
1.5. Field Normalized Dimension
1.6. Path Normalized Dimension
1.7. Sample Path Normalized Dimension
2.1. UB-Tree: Z-Region Partitioning and underlying B-Tree
2.2. Hierarchical Clustering
2.3. Hierarchical Clustering
3.1. Sample Schema with HC extensions
3.2. Dimension Hierarchies
4.1. Customer Dimension with two Hierarchies
5.1. Insertion of new records into a dimension
6.1. Standard Abstract Execution Plan
6.2. Abstract Execution Plan with Pre-Grouping
7.1. Basic Warehouse View
7.2. Fact Table View
7.3. Reference Surrogates of Fact Table
7.4. Dimension Calendar
7.5. Dimension Table Calendar
7.6. Compound Surrogate Dimension Calendar

List of Tables

7.1. Cardinalities

List of Examples

3.1. Star Query
3.2. CREATE Statement with SURROGATE Clause
3.3. CREATE Statement with SORROGATE ... FOR Clause
3.4. Output: rse Command

Introduction

In this document we discuss the basics of Transbase Hypercube in the data warehousing environment. In particular, we show the basic concepts of data warehouse star schemata, the modeling in Transbase Hypercube® and the star query execution on such applications. We further discuss some peformance issues and provide some hints how to optimize star query processing.

How to read this manual

This manual includes a detailed theoretical description of the multidimensional index concepts of Transbase Hypercube and some practical sections how to use the Hypercube index and design so called MHC schemata.

The theoretical section on Data Warehouses is good to understand the concepts behind the Hypercube index.

The practical reader can start with section Basics of Transbase Hypercube.

1. Data Warehouses

For a common understanding it is important to clarify our understanding of a data warehouse, a common terminology and some basic concepts.

1.1. Data Warehouse Components

This section does not contain a complete description of data warehouse concepts. We focus on standard data warehouse components. In particular, we shortly describe the following items:

Star and snowflake schemata
Fact Table
Dimensions and Measures
Hierarchies
Basic operations (drill up/down)
Basic data warehouse implementation (ROLAP)

Note that some of the concepts are described in detail later in this manual.

1.1.1. Fact Table

The fact table stores the quantitative data (facts or measures). Such data is for example the sales, the price etc. In a conventional relational DBMS data warehouse implementation, the fact table is a table containing dimension and measure attributes.

The fact table usually is the largest table in a warehouse schema and is the center of the so called star schema (see Section Star Schema ).

1.1.2. Dimensions and Measures

The data stored in a DW is organized multidimensionally. We distinguish between dimension and measure data.

Dimension attributes provide categorical (qualitative) data (e.g., products, customers, time), which determine the context of the measures. In most cases hierarchies are defined on dimensions (see Section Hierarchies ). The time dimension often consists of the hierarchy all - year - month - day or all - year - quarter - week - day, where all represents all dimension elements. In most cases, dimensions are declared by a context (e.g., by the business context, organization structure). Usually, dimensions are almost static (e.g. a change on a geographic hierarchy, where Germany belongs to North America will seldom occur). However, in some applications frequent changes are possible. Special semantics are necessary to handle changes and data concerned by the changes([Kim96]).

Measure attributes are numeric (quantitative) data (e.g., sales, cost, turnover) that are organized by multiple dimensions. In real multidimensional systems, dimension and measure attributes often are not distinguished.

1.1.3. Hierarchies

As mentioned earlier, dimensions are usually classified by hierarchies. Note that one dimension may contain several hierarchies. The DATE dimension could have the hierarchies year - month - day and year - week - day. Depending on the evaluation context, either the first or the second hierarchy is chosen.

Hierarchies play an important role in Transbase Hypercube, since hierarchies are used to cluster the data physically on the disk and therefore speed up query processing significantly. These concepts are described later in more detail.

1.1.4. Basic Operations

The basic operations of a data warehouse application are analyses of the data in order to gain information and dependencies of the data. The users start with a specific view onto the data and try to find out more detailed or more aggregated information. Such operations are called drill-down and roll-up. The drilling operations follow the hierarchy paths of the dimensions and execute queries on different aggregation levels.

For example, start with reports considering the time dimension on a "yearly" level. For a specific customer region, the year 2000 seems to be interesting and the data should be queried w.r.t. month resolution. Then the drill-down operation is from year to month. In contrast to drill-down, the roll-up operation aggregates for example from month level to year.

The slice and dice operations create individual views onto the data. With the slice operation a specific "slice" of the multidimensional cube is cut out where an aggregation of a specific measure on a special hierarchy level is computed. A further slice operation may look at the same measure by specifying different dimension aggregations. These operations are useful for ad hoc queries.

1.1.5. Star Schema

In relational DBMS, the schema of a data warehouse usually is represented by a star schema or snowflake schema. This section contains the description of the concept of star schemata.

Figure 1.1. Star Schema

A star schema combines the concepts of a relational DBMS with the multidimensional view to multidimensional data stored in a relational DBMS. This method is widely used for relational OLAP systems (ROLAP). Instead of storing all data (dimension attributes and measure attributes) in one single table, the fact table FT, dimension data is normalized out of the fact table into dimension tables. In contrast to the snowflake schema, dimension data is stored in one single dimension table for every dimension. Figure: Star Schema illustrates a star schema with n dimensions, expressed by the dimension attributes d₁, ..., d_n of FT and the corresponding dimension tables D₁, ..., D_n.

The center of the "star" is the fact table. The attributes d_i are the dimension attributes, the attributes m_j are measure attributes. Every d_i corresponds to D_i.h₁, i.e. the leaf level of the hierarchies in D_i. The (shared) leaf level of the hierarchy is the key of the dimension table. We require this key constraint due to two reasons: First, the composite key of the complete hierarchy path (h₁, h₂, ..., h_m) would be too long to store also in the fact table for every dimension (because of the foreign key relationship of the dimension attributes in the fact table). The leaf level often is an artificial (short) key due to space saving reasons. Second, if we have more than one hierarchy, the shared leaf level h₁ for all hierarchies serves a common key.

In addition to the hierarchy levels in D_i, feature attributes describe the dimension attributes and hierarchy attributes in more detail. Feature attributes can be assigned to each level in the hierarchy and thus may be stored redundantly in the dimension tables. Often the hierarchy levels are artificial keys used for hierarchical dependencies and contain an additional description field (modeled as feature).

In general more than one hierarchy is allowed per dimension. Storing data in a star schema will always guarantee correct hierarchies, because every record in the dimension table contains the full path of all hierarchies on that dimension. This leads to more space than in a snowflake schema, but will not require a separate join for every hierarchy level. In this manual, we call dimension tables of a star schema flat dimensions, where the complete dimension with hierarchies and features is stored within one dimension table.

A flat dimension table contains all dimension attributes, i.e., the hierarchy level attributes of all hierarchies and their feature attributes. Thus the hierarchical relationship is modeled within this table by storing the complete path within the record.

Figure 1.2. Sample Star Schema

In Figure: Sample Star Schema we show an example for a star schema with one fact table FACT and three dimension tables Product, Segment, and Time. The dimension key attributes Item, Outlet, and Day are the primary key of the fact table. The leaf levels of the hierarchies, i.e., Item for Product dimension, Outlet for Segment dimension, and Day for Time dimension are the primary key of the corresponding dimension tables.

1.1.6. Snowflake Schema

A snowflake schema is the standard and more general schema used in complex data warehouses. Designing applications with a snowflake schema provides complex hierarchical relationships, common use of hierarchies for many applications, flexibility for queries and less requirements for data space. However, in general, the performance of queries will suffer from the more complex schema, especially the number of joins is higher than for a star schema. The DBMS must provide special methods, in order to efficiently support snowflake schemata. In Transbase Hypercube, we designed and implemented our algorithms for both, star and snowflake schemata.

Figure 1.3. Snowflake Schema

The snowflake schema is an "extended" star schema in the meaning, that there are several hierarchy tables for every dimension - a kind of normalization (an example is shown in Figure: Snowflake Schema ). There are many different types of snowflake schemata depending on the normalization.

As in the star schema, the fact table FT is the center of the "snowflake" with the dimensions surrounding FT. Every dimension attribute d_i of FT has a foreign key relation to the primary key of the dimension table D_i¹.h₁ⁱ. In Figure: Snowflake Schema , we assume one hierarchy per dimension (in general, an arbitrary number of hierarchies per dimension is possible), where the key of the hierarchy is the leaf level h₁. Thus, the following relationship is required: FT.d_i = D_i¹.h₁. We call the dimension table D_i¹ the leaf dimension table LDT.

The higher dimension tables, i.e., the normalized dimension tables, contain one or more hierarchy levels. Basically, each dimension table contains an arbitrary number of hierarchy levels (depending on the kind of normalization). There must be a foreign key relationship between a lower dimension table D_i^k and a higher dimension table D_i^k+1: ( D_i^k.h_jⁱ, D_i^k.h_j+1ⁱ, ..., D_i^k.h_j+mⁱ) -> ( D_i^k+1.h_jⁱ, D_i^k+1.h_j+1ⁱ, ..., D_i^k+1.h_j+mⁱ ). This foreign key relationship requires that the attributes ( D_i^k+1.h_jⁱ, D_i^k+1.h_j+1ⁱ, ..., D_i^k+1.h_j+mⁱ ) are primary key of dimension table D_i^k+1. Each dimension table contains an arbitrary number of feature attributes, usually related to the hierarchy levels ( h_jⁱ, h_j+1ⁱ, ..., h_j+mⁱ ).

Definition Leaf Dimension Table, LDT: A leaf dimension table LDT of a dimension contains the key of the dimension, i.e., the dimension key attribute of the corresponding dimension in the fact table. This attribute is the leaf level for every hierarchy on the dimension. A LDT contains one or more foreign keys to hierarchy level tables and feature attributes.

Definition Hierarchy Level Table: A hierarchy level table contains a number of hierarchy levels of one hierarchy and a foreign key to the next (higher) hierarchy level. Additionally, feature attributes are stored in the hierarchy level table.

Figure 1.4. Sample Snowflake Schema

In Figure: Sample Snowflake Schema we show an example for a snowflake schema with a fact table FACT and three dimensions Product, Segment and Time. The dimensions Product and Segment are normalized into snowflake dimensions w.r.t. the feature attributes of the hierarchy levels. In the Segment hierarchy, the Region level has no feature attribute and no separate table is used for this hierarchy level. The Time dimension is modelled as a flat dimension.

For the sample snowflake schema, the following foreign key relationships are necessary:

Dimension Product:

FACT(Item) -> Product(Item)
Product(Group) -> Product_Group(Group)
Product_Group(Category) -> Product_Cat(Category)

Dimension Segment:

FACT(Outlet) -> Segment(Outlet)
Segment(MicroMarket -> Segment_MM(Micromarket)
Segment_MM(Country) -> Segment_Country(Country)

Dimension Time

FACT(Day) -> Time(Day)

Normalization of dimensions is not equivalent to classic normalization of relations. We talk about normalization, if a star schema is extended to a snowflake schema in the meaning, that hierarchy levels are stored in separate tables. On the one hand, normalization will save disk space, if feature attributes of higher levels are stored in a "higher" dimension level table and not redundant in the leaf dimension table. On the other hand, the schema can express hierarchical relationships that enable optimizing query processing, such as using properties of multidimensional hierarchical clustering. Additionally, maintenance of dimensions will be easier and more intuitive.

Most DBMS prefer star schemata over snowflake schemata, because query processing is easier and more performant. In a star schema, the number of joins is reduced to the dimension tables with the fact table. In a snowflake schema, the number of joins is increased, because each dimension can consist of several tables that must be joined for query processing. We prefer snowflake schemata, because more hierarchical relationships can be expressed in snowflake schemata. Consider a hierarchy with feature attributes. In a star schema, the feature attributes are stored in the dimension table and it is not clear to which hierarchy level the feature attributes belong . A predicate on a feature attribute is not mapped to a corresponding restriction on the hierarchy attribute and therefore cannot be optimized w.r.t. query processing with MHC (see Section: Query Processing ).

For efficient query processing, we require some prerequisites for normalized dimensions, i.e., a well-formed snowflake schema. A well-formed snowflake schema can increase the query performance, because in snowflake schemata, hierarchical relationships between feature and hierarchy attributes can be expressed. A well-formed snowflake schema is a special type of snowflake schema which is described in the following.

Definition well-formed Snowflake Schema:

A well-formed snowflake schema consists of a fact table with the dimension key attributes (d₁, d₂, ..., d_n) as primary key and a leaf dimension table D_i¹ for every dimension. The leaf dimension tables contain all hierarchy levels h₁ⁱ, h₂ⁱ, ... , h_tⁱ for all hierarchies of the dimension. The higher dimension tables are connected with the leaf dimension table via foreign key relationships. Between every higher dimension table D_i^k, a foreign key relationship with D_i^k+1 exists.

We require the leaf dimension table with all hierarchy levels for an efficient computation of compound surrogates (see Section Maintenance of MHC ). The hierarchy is de-normalized into the leaf dimension table defining the join over the complete hierarchy.

An example for a well-formed snowflake schema are the field normalized and path normalized schemata as described shortly in the following. Different types of normalization can be combined within one snowflake schema, since normalization is a property of the dimension. Thus, one dimension can be organized as field normalized dimension, another as path normalized, and a third dimension can be organized as a different normalization. For a field normalized and path normalized dimension we assume an LDT that contains all hierarchy level attributes of all hierarchies in the dimension. Thus the complete path of the hierarchies is stored in that LDT. In addition to the LDT, a number of hierarchy level tables define hierarchical dependencies. Without loss of generality, we assume one hierarchy for the dimension in the following.

Field Normalized Dimension:

In a field normalized dimension (FND), the leaf dimension table contains all hierarchy elements h₁ⁱ, ..., h_tⁱ of the corresponding hierarchy. The key of the LDT is h₁ⁱ. For some or all h_jⁱ ( 2≤j≤t ), a dimension level table D_i^j exists that contains the hierarchy level attribute h_jⁱ and a number of feature attributes f_k. The following foreign key relationships of the hierarchy attributes in the LDT D_i¹ to the corresponding attribute in the dimension level table exist: D_i¹ ( h₂ⁱ ) -> D_i² ( h₂ⁱ ), D_i¹ ( h₃ⁱ ) -> D_i³ ( h₃ⁱ ), ..., D_i¹ (h_tⁱ ) -> D_i^t ( h_tⁱ ) (see Figure: Field Normalized Dimension ). The key of the satellite dimension table D_i^j is the hierarchical attribute: D_i^j.h_j^j.

Figure 1.5. Field Normalized Dimension

A separate hierarchy level table D_i^j makes sense, if a feature attribute is assigned to the hierarchy level j, otherwise no table D_i^j is necessary for the hierarchy attribute D_i^j.h_j. To use a field normalized schema, the hierarchy attributes must be unique (in contrast to the path normalized schema). This means, that the value of a hierarchy attribute must be unique in that level. Otherwise a combination of levels is necessary to get the key for D_i^j!

Path Normalized Dimension:

In a path normalized dimension (PND), the leaf dimension table D_i¹ contains all hierarchy attributes h₁ⁱ, ..., h_tⁱ of the corresponding hierarchy. Key is h₁ⁱ. We additionally assume one hierarchy level table D_i^k for every hierarchy level k ( 2≤k≤t ) that contains the path from the top of the hierarchy to the level k (h_tⁱ, h_t-1ⁱ, ..., h_kⁱ). Key of D_i^k can be the smallest hierarchy level h_kⁱ or a combination of hierarchy levels. Additionally a number of feature attributes f_k is stored in every D_i^k. (see Figure: Path Normalized Dimension ).

A PND additionally requires foreign key relationships for every hierarchy level table D_i^k to ensure a correct hierarchy: D_i^k ( h_tⁱ, h_t-1ⁱ, ..., h_k+1ⁱ ) -> D_i^k+1 ( h_tⁱ, h_t-1ⁱ, ..., h_k+1ⁱ ) for 2≤k≤t-1 . Thus, the complete prefix path of levels k+1 to t must also exist in the next hierarchy level table. In order to define the foreign key relationship, we have to introduce corresponding unique indexes or define the primary key of the dimension tables correspondingly.

Figure 1.6. Path Normalized Dimension

A PND can express hierarchical relationships between feature attributes and hierarchy levels, if the members in the hierarchy level are not identified uniquely.

In Figure: Sample Path Normalized Dimension , we show a path normalized dimension for the Product dimension of the sample schema. The leaf dimension table Product contains all hierarchy levels Item, Group, and Category and further feature attributes. The higher dimension table Product_Group contains the two hierarchy levels Group and Category and the highest dimension table Product_Cat contains the Category hierarchy level. We define the following foreign key relationships:

FACT(Item) -> Product(Item)
Product(Group, Category) -> Product_Group(Group, Category)
Product_Group(Category) -> Product_Cat(Category)

Figure 1.7. Sample Path Normalized Dimension

1.2. Data Warehouse Implementation

When implementing a data warehouse system considering the database management system, there are a couple of topics to think about. Note that this list is suggested by us and only reflects our understanding of implementing a data warehouse application. We think that an exact analysis is necessary before you implement the logical and physical schema in the database. These analyses include the customers' requirements in terms of queries they want to execute and the examination of the data existing so far. We think that the schema of a data warehouse is driven by the users' needs (and further plans of the users), because the DBMS must match the users' requirements. In practise, of course, the users' requirements change over the time when the users can execute reports (they want more complex reports, ad hoc queries etc.). But it is a proof of quality of the data warehouse designer to predict the users' future wishes and include these predictions into the first data warehouse design, in order to provide a long-running and long-living scalable data warehouse for the users. In the following we discuss some important tasks the warehouse designer has to examine before and during building the data warehouse:

Identifying dimensions and hierarchies
Analyzing and identifying queries
Analyzing data (distribution, cardinalities, hierarchy cardinalities, etc.)
Creating the logical and physical Schema
Implementing the ETL process
Implementing star queries

The following sections describe each topic in more detail.

1.2.1. Identifying dimensions and hierarchies

Dimensions are reflected in the conceptual schema of a data warehouse application, but are essential for the physical implementation of such a schema. Since we follow the concept of ROLAP, dimensions are qualifying attributes in the fact table (see Section: Dimensions ). MOLAP implementations consider all attributes in a warehouse schema as dimensions (and thus define a very high number of dimensions, sometimes several hundred or even more). In ROLAP implementations, the number of dimensions usually is between 3 and 10, because dimensions are further classified by hierarchies.

Typical dimensions are the TIME dimension holding a hierarchy year - month - day. Further attributes such as weekday, week, month_text, etc. are part of the dimension TIME and do not represent own dimensions. This is important for the implementation of a data warehouse with Transbase Hypercube, because dimensions and hierarchies are used to physically cluster the data on disk (see Section: Hierarchy Clustering ) and therefore influence the performance of queries.

Another example for a dimension is a geographical dimension CUSTOMER with a possible hierarchy country - state - city - customer and additional attributes like address, age, profession, etc.

After identifying the dimensions, you should identify all hierarchies for each dimension. Note that a dimension may be organized by several hierarchies. Hierarchies are used for the drilling paths of the report tools for the data warehouse application. Hierarchies are the semantic of the data warehouse from the users' point of view and therefore should be considered carefully for the design phase! Since hierarchies are used for the physical clustering for Transbase Hypercube warehouse implementations, a reorganization of the hierarchies may result in a database reorganization!

Hierarchies usually have hierarchy level attributes (in our example before: year, month, day) representing the hierarchical relationships, and additionally so called feature attributes (in our example: weekday, month_text). The feature attributes are additional information for the hierarchy levels and each are assigned to a hierarchy level. For example, weekday is assigned to the hierarchy level day, month_text is assigned to the hierarchy level month etc. We call the feature attributes dependent on hierarchy level attributes.

For the implementation of the physical schema, chose one hierarchy for each dimension (if there are several hierarchies). This hierarchy should be the most important hierarchy for the users. The definition of "most important for the users" is quite informal, since each warehouse project has its own weighting of importance for users. For example consider a couple of power users that run important and time critical reports compared to a large number of users that run less important queries. Then the reports of the power users may be more important and should be supported more efficiently than the queries of the remaining users.

1.2.2. Analyzing the Users' Queries

Usually, a lot of communication with the users is necessary to decide (and even find) the important queries. But changing a running data warehouse is a very time consuming task, when important queries were not identified during the warehouse design phase! Thus, put enough effort on the specification phase!

When an existing data warehouse is to be replaced by a new data warehouse, it is relative easy to identify the most important queries, even to get a more or less complete query profile: Simply log the queries over a representative amount of time and analyze these queries. It often is helpful to categorize the queries into query classes (e.g., w.r.t. restricted dimensions, hierarchy levels, feature attributes etc.). Assigning each query class the users and the frequency gives a good idea about the importance of such a query class. Such a query profile also enables to identify important dimensions and hierarchies.

However, do not only consider the existing queries, but also find out which queries are important in the future. This examination requires talking to the right people, strategic planners etc. Categorize the queries of all consulted users and find common query patterns, in order to define query classes. Then you also find the important dimensions and hierarchies.

1.2.3. Analyzing Data

For some physical design decisions, also knowledge of the data is important. For example, the specification of the number of siblings for the hierarchies requires an examination of the fanout of the particular hierarchy levels (see Section Extended DDL ). First identify the hierarchy levels and then analyze the cardinality of each hierarchy level considering the existing data. It is important to know the number of maximum sons of a hierarchy node. For example in the TIME hierarchy, it is obvious that each year has 12 months. Therefore the maximum fanout (and the maximum number of siblings) for the months hierarchy level is 12 (indepently how many years are stored in the dimension).

The following sample SQL statements return the number of siblings for the hierarchy levels year, month, day, id (can be used also for other hierarchies):

year:  SELECT COUNT(*) FROM SELECT DISTINCT year FROM D;
month: SELECT MAX(c) FROM
         SELECT COUNT(*) as c FROM
           SELECT DISTINCT year, month FROM D
           GROUP BY year, month
       GROUP BY year;
...
id:    SELECT MAX(c) FROM
         SELECT COUNT(*) as c FROM
           SELECT DISTINCT year, month, day, id FROM D
           GROUP BY year, month, day, id
         GROUP BY year, month, day

The results of these statements reflect the current number of hierarchy siblings. However, dimensions usually grow with the time. For example, the number of customers may increase each week. Since the number of siblings is important for the physical clustering, it is also important to consider the growing of the hierarchies for the hierarchy specification in the physical schema. Therefore, you should estimate the future number of maximum siblings per hierarchy level. For example, you may have the CUSTOMER dimension with the hierarchy country(200) - state(100) - city(1000) - customerid(5000), where the numbers in parantheses reflect the current maximum number of siblings (in one city there are maximal 5000 customers). Since the maximum number of siblings for the upper levels (country, state, city) will not increase very much (due to geographical reasons) the maximum number of customers per city may increase in the next 20 years to estimated 10000. In this case, define the hierarchy with 10000 or (to be sure with twice as much, i.e., 20000) siblings.

A too low number of siblings will result in a hierarchy overflow error and requires a reorganization of the star schema in the database which can take a long time. Thus, be careful with these estimations, but do not use much too high numbers, because too high numbers influence the space and the query performance.

1.2.4. Creating the logical and physical Schema

When all dimensions and hierarchies are identified and the hierarchy properties are known, the logical and physical schema can be implemented. You can use DDL scripts containing DDL statements for the CREATE TABLE statements, indexes, privileges etc. You also can use data warehouse tools that are suitable for Transbase Hypercube. Note that these tools must support the special Transbase syntax for defining hierarchies, MHC schema etc. See Section: MHC in Transbase .

1.2.5. Implementing the ETL process

The ETL process is one of the most critical (and time consumptive) parts in a data warehouse implementation. This section only gives a very short introduction into ETL and discusses very few aspects. A lot of literature and commercial tools are available in this field.

ETL stands for Extraction - Transformation - Load. It contains the import of the data from source systems. Since the quality of data warehouse results depends on the quality of the data stored in the data warehouse, the ETL task must be robust and correct!

The first step is to extract the data from the source systems. The format of the extracted data often are flat files used as input for the next steps. But the extraction also could be a connection to host database systems, OLTP systems. etc.

Often, multiple sources are used for the data warehouse. Thus, the format of the source data may vary. For example, the address of customers could be stored in different formats for different source systems. These formats must be unified in such a transformation step. If price data is stored in different currencies, there must be re-calculations to a common currency etc. The rules for the transformations may be very complex and the result must be unified data formats that can be stored in a common data warehouse.

After the transformation step, the data must be loaded into the data warehouse. Mass loading must be supported by the database system, as it is for Transbase Hypercube. The load of the data can be done via flat files (the most common method), via so called staging areas in the database or via other methods. See the literature for more information about this topic.

1.2.6. Implementing star queries

The implementation of the star (or snowflake) queries is also an important step. These queries are called (or generated) by reporting tools or are formulated directly by the users. It is often easier to provide some query templates where the users (or tools) can insert restrictions for the dimensions. These query templates have similar formats and once understood are easy to implement.

2. Basics of Transbase Hypercube

Transbase implements the multidimensional Hypercube index as real multidimensional index. This index provides properties that distinguish it from traditional B-Trees. In the literature, the Hypercube index is called UB-Tree.

2.1. UB-Tree

We just give a short introduction to UB-Trees here, details can be found in [Bay97, Ram00]. The basic idea of the UB-Tree is to use a space-filling curve to map a multidimensional universe to one-dimensional space. Using the Z-curve for preserving multidimensional clustering it is a variant of the zkd-B-Tree. A Z-address a = Z(x) is the ordinal number of the key attributes of a record x on the Z-curve, which can be efficiently computed by bit-interleaving. A standard B-Tree is used to index the records taking the Z-addresses of the records as keys. The pagination of the B-Tree creates a disjunctive partitioning of the multidimensional space into so-called Z-regions. This allows for very efficient processing of multidimensional range queries.

Figure: UB-Tree shows a Z-region partitioning for a two-dimensional universe and the corresponding B-Tree. The interval limits of the Z-regions are also depicted.

Figure 2.1. UB-Tree: Z-Region Partitioning and underlying B-Tree

The processing of basic operations, i.e., insertion, deletion, update, and point query, of the UB-Tree are analogous to the basic operations of the B-Tree. For each record the corresponding Z-address is computed, and with the resulting value the underlying B-Tree is accessed. Thus, all basic operations require cost proportional to the height of the tree. The only recommendable modification to the standard B-Tree algorithms is an adaptation of the split algorithm to achieve a good (as rectangular as possible) Z-region partitioning.

A UB-Tree is especially good in processing multidimensional range queries, as it only retrieves all Z-regions that properly intersect the query box. Consequently, it usually shows the nice property that the response time of the range query processing is proportional to the result set size.

2.2. Clustering of Hierarchies

Hierarchies play an important role in various application domains. They are used to provide a semantic structure to data, e.g., a geographical classification of customers in a data warehouse. As the hierarchies cover the application semantics they are used frequently by users to specify the restrictions on the data as well as the level of aggregation. The restrictions on the hierarchies usually result in point or range restrictions on some hierarchy levels. The problem that arises is that these restrictions on upper hierarchy levels lead to a large set of point restrictions on the lowest level, i.e., the level with the most detailed information. This situation is depicted in Figure: Hierarchical Clustering (a): restricting the level 'Product Group' to the value 'VCR' leads to the set of ids {5,8,21} and not to the interval [5,21], as the item with id 11 does not belong to the specified product group.

For most access methods it would be more efficient to process one range restriction instead of a set of point restrictions. The resulting question is how to map a point/range restriction on a higher hierarchy level to a range restriction on the lowest level? To this end, Transbase applies the clustering scheme for hierarchies. A special kind of keys is used for the elements of the lowest level which reflect the hierarchy semantics, i.e., keys which adhere to the partial order defined by the hierarchy levels. These so-called (compound) surrogates guarantee that the keys of all elements in a sub-tree of the hierarchy lie within a closed interval (Figure: Hierarchical Clustering (b)) s uch that a key of an element not lying in the subtree is not within the interval. In our example, the restriction to the product group 'VCR' now leads to the interval [48,50]; the item with id 11 is mapped to the surrogate 33 that does not violate the interval.

Figure 2.2. Hierarchical Clustering

We refer to this technique as hierarchy clustering (HC) from now on. If we combine HC and multidimensional indexing on multiple hierarchy encoding as it is done in Transbase , then we speak of multidimensional hierarchical clustering (MHC).

In Figure: Hierarchical Clustering we show how the UB-Tree benefits from hierarchy clustering. This figure shows two dimensions (TIME and CUSTOMER), the partitioning of the UB-Tree into z-regions and the effect of hierarchy clustering. The query restricts the TIME dimension to "Februar 2003" and the CUSTOMER dimension to all customers in "Leipzig". Using hierarchy clustering, both restrictions form an interval and result in a two dimensional query box on the UB-Tree. Because of the good clustering of the UB-Tree, the data read from disk (all red-coloured regions) is almost identical to the data needed to answer the query (only very few additional data is loaded). Without hierarchy clustering, there would be a lot of very small query boxes, because the data in the dimension might be spread over the dimension extensively.

Figure 2.3. Hierarchical Clustering

3. MHC in Transbase

In this section, we will briefly present the users perspective when using the OLAP functionality in Transbase .

For our discussions we use a conventional star schema with a fact table consisting of dimension (qualitative) and measure (quantitative) attributes [Kim96]. For the dimensions typically one or more hierarchical classifications based on the dimension attributes (often referred to as features) exist. The primary key of the dimension represents the most detailed level of the dimension hierarchies.

In this paper, we focus on star schemata for the ease of description. However, the algorithms also are implemented for general snowflake schemata.

3.1. Sample Schema

Figure 3.1. Sample Schema with HC extensions

As running example throughout this paper we use the schema depicted in Figure: Sample Schema. This data warehouse stores sales transactions recorded per item, store, customer, and date. It contains one fact table FACT, which is defined over the dimensions: PRODUCT, CUSTOMER, DATE, and LOCATION with the obvious meanings. The measures of FACT are price, quantity and sales representing the values for an item bought by a customer at a store at a specific day. The schema of the fact and dimension tables is shown in Figure: Sample Schema and the dimension hierarchies are depicted in Figure: Dimension Hierarchies .

The dimension DATE is organized in three levels: day - month - year. The dimension CUSTOMER is organized in two levels: customer - profession. For each customer the dimension table contains an ID, a name, an address, and a profession. The dimension has two hierarchical attributes (person_id, profession) and two feature attributes (name, address). The LOCATION dimension is organized by four levels: store - city - region - country. Stores are grouped into cities, these are grouped into regions and the regions finally are grouped into countries. For each city, the population is stored as feature attribute. The dimension has four hierarchical attributes (store, city, region, country) and one feature attribute (population) that is assigned to the city level.

Finally, the PRODUCT dimension is organized into three levels: item - group - category. Items are grouped into product groups and those are further grouped into categories (e.g., "air condition"). The attribute brand characterizing each item is a feature attribute.

Star queries are written in standard SQL, i.e., the join attributes between the fact and dimension tables are the dimension key attributes:

Example 3.1. Star Query

SELECT
  SUM(sales)
FROM
  FACT F, DATE D, PRODUCT P , LOCATION L
WHERE
  D.day=F.day AND P.item=F.product AND
  L.store=F.store AND D.year=2002 AND
  P.category = 'Air Condition' AND
  L.population > 1000000
GROUP BY
  D.year, D.month

This query returns the sum of sales for the year 2002 for air conditions in cities with a population larger than one million.

Figure 3.2. Dimension Hierarchies

3.2. The User's View: Hierarchy Specification by extended DDL

To allow the user for defining such a schema, the DDL has been extended to express hierarchies and the desired physical organization of the fact table according to the dimension hierarchies. This is achieved by specifying an additional field per dimension table per hierarchy that basically represents the compound surrogate derived by HC as described earlier. The keyword SURROGATE is used to mark the definition of a surrogate field as opposed to the definition of a standard user visible field. In the dimension table, we denote a compound surrogate by the keyword COMPOUND. The hierarchy levels are specified after this keyword by enumerating the hierarchy levels from top to bottom. The maximum fan-out of each hierarchy level is denoted by the keyword SIBLINGS (to specify the number of bits reserved for the hierarchy level) after the corresponding hierarchy level. The following DDL statement shows the CREATE TABLE statement for the Location dimension:

Example 3.2. CREATE Statement with SURROGATE Clause

create table Location (
  country char(*) NOT NULL,
  region char(*) NOT NULL,
  city char(*) NOT NULL,
  store char(*) NOT NULL,
  population integer,
  SURROGATE cs_location COMPOUND (
   country SIBLINGS 10, region SIBLINGS 50,
   city SIBLINGS 50, store SIBLINGS 1000),
  PRIMARY KEY (store)
)

The surrogate specification defines the hierarchy depicted in the Figure: Sample Hierarchies . The SIBLINGS information specify that in this hierarchy there are at most 10 countries, at most 50 regions per country, at most 50 cities per region, and at most 1000 stores per city. Thus, the Location dimension may at most contain 10*50*50*1000= 25.000.000 members.

For the fact table specification, we now have to specify the physical organization, besides the standard relationships to the dimension tables. As we want to cluster the data according to the hierarchies of the dimension, we use the surrogates instead of the "logical" keys. To this end, we introduce the concept of reference surrogates. Technically, a reference surrogate is an additional system maintained field in the fact table. Because of the necessary foreign key constraints in the fact table, it is possible to decide to which dimension the reference surrogate belongs. We use again the keyword SURROGATE to denote a surrogate. Then we use the keyword FOR, in order to assign the reference surrogate to a dimension:

Example 3.3. CREATE Statement with SORROGATE ... FOR Clause

create table fact (
  product char(*) NOT NULL
       references product (item) ON UPDATE CASCADE,
  store char(*)NOT NULL
       references location(store) ON UPDATE CASCADE,
  time integer NOT NULL
       references date(day) ON UPDATE CASCADE,
  sales numeric(10,3),
  price numeric(10,3),
  quantity numeric(10,3),
  SURROGATE cs_prod FOR product,
  SURROGATE cs_store FOR store,
  SURROGATE cs_time FOR time,
  PRIMARY HCKEY (cs_prod, cs_store, cs_time)
)

A different keyword for the key specification the UB-Tree (PRIMARY HCKEY) is used to specify the index access method, namely a UB-Tree (HC stands for Hypercube as the UB-Tree is called in Transbase ).

This statement also shows that the definition of reference constraints (foreign key constraints) with the ON UPDATE CASCADE clause to all dimension tables is necessary. This is because of two reasons: First the reference constraints are used for the reference surrogate lookup. Second, there must be an update trigger, if a record in the dimension table changes some field values and the hierarchy (and therefore the compound surrogate) is changed. In this case, the'new value'for the compound surrogate must be propagated to all records of the fact table with the same dimension key resulting in a re-clustering of these records. You can also specify the ON DELETE CASCADE clause, if you need this. Otherwise deleting records in the dimension table which are referenced by the fact table will not be allowed.

All further statements (especially INSERT and UPDATE) may (and must) ignore the additional fields. This is comparable to the creation of a secondary index which is made up by the user but then becomes a system maintained part of the database.

It is important to note that all fields created by the SURROGATE specification are system maintained and should not be modified or set by the user. The fields, however, are visible for example when calling the rse command in the tbi. This command shows the CREATE TABLE statement for the corresponding table (among further statements):

Example 3.4. Output: rse Command

rse Location
create table "LOCATION" with ikaccess
( "COUNTRY"  char(*) not null,
  "REGION"  char(*) not null,
  "CITY"  char(*) not null,
  "STORE"  char(*) not null,
  "POPULATION" integer,
  SURROGATE "CS_LOCATION" COMPOUND ( "COUNTRY" SIBLINGS 10,
  "REGION" SIBLINGS 50, "CITY" SIBLINGS 50, "STORE" SIBLINGS 1000),
)  key is "STORE"
;

As the example shows, the user can very naturally specify the physical organization of the data according to the schema semantics. In the following sections we will discuss the internals of the system transparent to the user.

4. Implementing HC in Transbase

In this section we describe some interna of the implementation of hierarchical clustering, because we think, it is helpful for the understanding of the the schema design and internal algorithms.

For the implementation of HC in Transbase various issues have to be solved. The most important one is the internal representation and management of the surrogates as well as schema extensions, necessary for automatic processing. Before we continue, we want to introduce some terminology that is frequently used in the remainder of the paper. We refer to the fact table as FT and to the dimension tables as D_i. We use FT.m_k to denote a measure attribute of the fact table, D_i.h_j to denote a hierarchical attribute (h₁ denotes the leave level of the hierarchy, i.e., it is the key of the dimension table), and D_i.f_k denotes a feature attribute of the dimension. PRED and AGGR are placeholders for any predicate resp. aggregation function on the specified attribute.

4.1. Internal Representation of Compound Surrogates

As already indicated in Figure: Sample Schema, the SURROGATE specification in the "create table" statement leads to the creation of an extra field of type BITS2 in the dimension or fact table containing the encoding (surrogate key, "hsk") of the corresponding hierarchy. It is unique for each dimension record as each leaf member of the hierarchy is assigned a unique value by HC.

The hsks are also contained in the fact table, as there they are required for the physical clustering of the data. As the logical dimension keys and the hsks are substitutes to each other, one may save a lot of space in the fact table if one would suppress the logical keys and just keep the hsks, if available for a dimension. However, this may lead to drastic performance decrease in cases where the original keys of the fact table records are accessed in the query. This would lead to expensive joins with the dimension tables. The suppression of logical keys is not implemented in Transbase.

Internally, compound surrogates are fixed length bit strings. The length corresponds to the siblings specification of all surrogate components.

4.2. System Catalog Extension

In order to implement hierarchical clustering the system needs to have knowledge about the defined surrogates. The system catalog extension is designed to allow several compound surrogates (and thus hierarchies) within one table.

4.3. Automatic Index Creation

For efficient query retrieval and surrogate maintenance, we need two special secondary indexes on the dimension table. Let cs be the field name of the compound surrogate and h_t,..,h₁ be the list of field names of the levels for the compound surrogate definition. Transbase automatically creates the following indexes (here described by the following virtual SQL CREATE statements):

CREATE INDEX "@@sys_surrCSX_<surrid>" ON <dim_table> (cs);
CREATE INDEX "@@sys_surrHX_<surrid>" ON <dim_table> (ht,..,h1,cs);

Thus, the index names consist of two components, a prefix that marks the index as system index, and a generated suffix of the kind of the index (CSX or HX) and the surrogate id of the corresponding surrogate. The indexes are needed for the computation of compound surrogates, for the lookup of reference surrogates (see section fact table insert ) and for query processing. Of course, these indexes cannot be dropped by the user.

When creating own indexes on a dimension table, the compound surrogate should be also a component of the index, because the performance of the evaluation of dimension predicates can be improved.

4.4. Multiple Hierarchies

Figure 4.1. Customer Dimension with two Hierarchies

A dimension of a data warehouse may include several independent hierarchies. Because of the representation of hierarchies by compound surrogates, we have to deal with several compound surrogates, one for every possible hierarchy. An example of several hierarchies is shown in Figure: Multiple Hierarchies. The customer dimension has a geographical hierarchy with country - region - town - customer and an organizational hierarchy with profession - customer.

We have to distinguish between two levels of hierarchies. The conceptual level defines hierarchies on the conceptual data warehouse schema. Depending on the data warehouse application, multiple hierarchies may be defined on all dimensions representing the application data model. Usually hierarchies represent drill and aggregation paths for user queries.

The physical level of hierarchies is responsible for the clustering properties of the hierarchies, in combination with hierarchies of other dimensions, i.e., the complete MHC schema. The number of clustering hierarchies is restricted due to the properties of the clustering multidimensional index. It is recommended to use one hierarchy per dimension for clustering. If more hierarchies are used for dimension restrictions frequently, the developer should chose one of the hierarchies for physical clustering. One dimension table may serve for several fact tables, that use different hierarchies for clustering.

The internal structures allow to establish an arbitrary number of hierarchies, represented by compound surrogates. However, we require that the leaf level of all hierarchies is the same (a so called shared leaf level), usually the primary key of the dimension table. This hierarchy property is checked when creating the table and the compound surrogates in the DDL statement. With multiple hierarchies allowed on one dimension table, we can use the dimension table for several fact tables that can be clustered w.r.t. different hierarchies. So we avoid redundancy problems for replicated dimension tables.

Every compound surrogate is assigned a unique id. This so called surrid is necessary to assign the reference surrogates the coresponding compound surrogate. One fact table includes an arbitrary number of reference surrogates specified by the surrid of the corresponding compound surrogate of the dimension tables. Thus, we can use reference surrogates of several hierarchies of one dimension table within one fact table. These reference surrogates may be used as index key attributes and thus for clustering the fact table according to several hierarchies.

5. Maintenance of MHC

After discussing the internal representation of the surrogates, we now turn to the automatic maintenance of the hierarchical clustering. We start with the insertion of new dimension members and continue with the insertion of fact table records.

5.1. Computation of Compound Surrogates

Computation of a compound surrogate hsk occurs when a record is inserted into the dimension table. Updates of hierarchy fields also may lead to a re-computation of hsk. In the following picture, the insertion algorithm is depicted for our example product hierarchy. Figure: MHC Maintenance schematically shows the insertion of a new record. We assume the values v_i for each hierarchical field h_i. Considering the hierarchy, the insertion of a record means the creation of a new path (v_k,...,v₁) in the dimension D.

Figure 5.1. Insertion of new records into a dimension

For the computation of the compound surrogate hsk we have to check if there exists already a prefix of the new path in D. For a record to be inserted, we call the already existing part of the new path the matching prefix path (MPP). The MPP may be empty as in Figure: MHC Maintenance (a) - in this case a new root element, here "DVD", has to be created and the forest grows by one tree.

A non-empty MPP (see Figure: MHC Maintenance (b)) comprises levels (h_t,...,h_k) for some k with k>1 and k∈t . The number k then is called the match level of the new record's path (2 in our example). At the next lower level, i.e., the first non-matching level, the surrogate for the'new value'is determined. Usually, the maximum surrogate is incremented by one, but one may also use different schemes to compute the surrogate, for example if one wants to reuse surrogates from deleted elements. According to the SURROGATE definition the single surrogate values are concatenated to build the compound surrogate hsk.

5.2. Insertion into the Fact Table

An insertion of a record into the fact table specifies the key dimension attributes (the dimension attributes of the leaf hierarchy level). For each dimension, however, the corresponding compound surrogate must be found.

This so called lookup is to be performed before the insertion of the record into the fact table because the record has to be extended by the corresponding compound surrogates for the dimensions (reference surrogates). For every dimension d_i, a direct access to the dimension table is performed to retrieve the corresponding cs.

Depending on the number of dimensions, a number of B*-Tree accesses is necessary and thus will decrease the insert performance. Especially for bulk loading, such a procedure may cause long insertion times. In order to improve mass loading, an alternative lookup concept has been implemented in Transbase. A hash table with the dimension keys and corresponding compound surrogate is created for every dimension before inserting the fact table records. Then the lookup is very efficient, because only main memory accesses will be necessary in the hash tables.

5.3. Hierarchy Reorganization

Usually, dimension hierarchies are stable. However, there are various application domains, where frequent hierarchy reorganizations, e.g., moving of sub-trees, occur and should be reflected in the data organization. Such operations require additional support by the DBMS as local operations on dimension tables now have immediate influence on the organization of the fact table. These operations are usually much more expensive. For example, reclassifying one product group to a different category will cause changes in the hsks. In order to have a correct physical clustering of the fact table all fact records corresponding to the changed product group have to be updated. Of course, arbitrary hierarchy reorganizations by data updates are supported, but it has to be kept in mind that the necessary fact table updates may be very time critical.

6. Query Processing with MHC

In this section we discuss the basic star query processing for DW schemata with MHC and multidimensional index organization and its advantages. For the discussion in this manual, we restrict ourselves to standard "star queries".

6.1. Query Template

The following pseudo SQL example shows the SQL query template for simple ad hoc star-queries. The notation {X} represents a set of X objects. The template defines typical star queries and uses abstract terms that act as placeholders. Note that the queries that conform to this template generally have a structure that is a subset of the above template and they instantiate all abstract terms.

SELECT  {D.h},{D.f},{FT.mi},
  {AGGR(FT.mj) AS AM}, {AGGR(D.h) AS AH},
  {AGGR(D.f) AS AF}
FROM  FT,{D1},...,{Dn}
WHERE  FT.d1 = D1.h1 AND
  FT.d2 = D2.h1  AND
  ...
  FT.dn = Dn.h1  AND

  PRED(D1)  AND
  PRED(D2)  AND
  ...
  PRED(Dn)  AND
  PRED({FT.m})
GROUP BY  {D.h},{D.f},{FT.m}
HAVING  PRED({AM},{AH},{AF})
ORDER BY  <ordering fields>

Our template is applied on a schema similar to the one in Figure: Sample Schema , which is a typical star schema. It specifies the restrictions on the various dimensions, the star-join between the fact table, and the required dimension tables and the subsequent grouping and aggregation. In general any attribute (hierarchical, feature, or measure) can appear in a GROUP BY clause. However, most queries group the result by a number of hierarchical and/or feature attributes. Finally, there is an ORDER BY to control the order of the presented results.

6.2. Star Query Processing with MHC

Processing of star queries as described above can roughly be divided into three major steps:

Evaluation of dimension predicates
Fetching result records from the fact table
Residual joins, grouping and aggregation, sorting

Figure 6.1. Standard Abstract Execution Plan

The first step evaluates the predicates on the dimension tables. The second and third step are often considered together as not all restrictions can be evaluated on the fact table only. In the following, we speak of two processing phases:

Predicate Evaluation
Main Execution Phase

In the presence of MHC and a multidimensional index for the organization of the fact table new optimizations can be applied to these phases. This is reflected already in the abstract execution plan (AEP) of Figure: Standard AEP .

6.3. Predicate Evaluation and Fact Table Access

In the predicate evaluation phase the first benefit of HC can be observed. Instead of generating a large set of point restrictions, the local predicates on the dimensions usually result in a (small) set of range restrictions. This is especially true for predicates with hierarchical restrictions (cf. Figure: Hierarchical Clustering (b)) but also many feature restrictions will lead to ranges if there is some dependency between the defined hierarchy and the feature values.

The intervals are generated efficiently by using the automatically system-generated index "@@sys_surrHX_<surrid>". For example, if the highest hierarchy level (or a so called prefix path from top of the hierarchy to a special level _k) is restricted, the index can be used to determine the lower and upper bound of the interval defined by the predicate. There is only one index access necesary, because the compound surrogate is implemented as a bit string. The bits of the specified hierarchy levels are supplemented by the bit string "00...00" to form the lower bound of the interval. The supplement by "11...11" forms the upper bound.

If no prefix path of the hierarchy is specified, all records must be evaluated that fall into the predicate. The compound surrogates of these records are collected and a special optimization step filters the minimum number of intervals defined by these compound surrogates. This optimization also works for feature restrictions. If the feature restrictions adapt very well to hierarchy levels, the number of intervals also will be very small.

After the generation of the intervals per dimension, the combination of intervals is transformed into a number of query boxes that are executed by the Fact Table Access. At that point, the multidimensional organization of the fact table yields a major cost saving: relying on the clustering and the efficient range query algorithm of the UB-Tree the number of I/O to fetch the required data is drastically reduced.

A small number of intervals results in a small number of multidimensional query boxes for the Hypercube access and can be processed efficiently. A high number of intervals, however, will cause overhead during query processing. Note that the number of query boxes is the product of the number of intervals of all hierarchies and can grow very high!

6.4. Main Execution Phase: Grouping and Aggregation

The Main Execution Phase joins the records that are fetched from the fact table with all necessary dimension tables (Residual Join operator). After the residual join the records are grouped (Group-Select operator), filtered again (Having operator) and sorted (OrderBy operator). This phase strongly benefits from the additional information which is encoded in the fact table records via the reference surrogates. The reference surrogates represent an encoding of the hierarchy path of each level member - therefore important optimizations can be realized for GROUPing on dimension fields. Especially, in many cases, a great part of the grouping work can be done locally on the fact table records using prefixes of the reference surrogates.

The important point is that this pre-grouping step is done before the fact table records are joined with dimension tables. As the grouping operation typically greatly reduces record cardinality, the cost for the successive join operation is also greatly reduced. This technique is called Hierarchical Pre-grouping. In detail, if the GROUPing field belongs to hierarchy level h_k or is functionally dependent on it then the h-surrogate prefix h_t/h_t-1/.../h_k is dynamically computed and the GROUP operation is done on that value.

A simple example illustrates the effects of the hierarchical pre-grouping method: Assume we have a DW with a time dimension (besides other dimensions) categorized by year - month - day and a well populated fact table. A query restricting the result to year 2001 (besides other restrictions) qualifies 100.000 fact table records. If the result has to be grouped w.r.t. month, we have to join 100.000 records with the time dimension table. When applying hierarchical pre-grouping, the number of join operations is reduced by a factor of 30, because all days of one month are aggregated to the month.

Figure 6.2. Abstract Execution Plan with Pre-Grouping

Figure: AEP Pregroup shows the final abstract execution plan used in Transbase . Due to the effective pre-grouping steps, the costly overall residual join step is avoided if possible.

For detailed information about hierarchical grouping please refer to [Pie03].

7. Practical Issues

This chapter contains some issues that should be considered for working with data warehouses and Transbase Hypercube. In particular, we discuss the following topics:

Load of data
Maintaining the data warehouse
Analyzing queries
Set database parameters

The first step is always to design a database schema. Depending on the application the schema must fulfill the user's requirements. We distinguish between the conceptual, logical, and physical schema. After the schema is fixed, the warehouse can be built by creating the database and executing the corresponding SQL statements (tables and indexes, views, users, privileges, etc.). Usually, a lot of data is loaded into the data warehouse (initial load). Then data is loaded periodically (and incrementally) into the warehouse, e.g., daily or weekly. Applications must be built to access the data in the warehouse, for example reporting tools, customer applications etc. From time to time there are some maintenance operations necessary for efficient access to the warehouse. All these steps are described in the following sections in more details.

In this chapter we show how to load data into a Transbase Hypercube warehouse (fact table, dimension tables). We further discuss some maintenance aspects for Transbase data warehouses like backup and recovery, deleting old data etc. We show a tool that can display some warehouse schema information, give some hints about database parameters, and give some rules of thumb when working with Transbase and data warehouses. A main part of this chapter describes how to analyze queries and understand Transbase operator trees for MHC star queries.

7.1. Practical Schema Design

Transbase Hypercube provides good clustering for up to 6 dimensions. In practice, data warehouse implementations require sometimes more dimensions. In such a case, one has to decide, which of the n dimensions to use for clustering (at most 6) and which are not considered for clustering. We call clustering dimensions primary dimensions and non-clustering dimensinos secondary dimensions. Primary dimensions occur in the PRIMARY HCKEY clause and form the primary hypercube key of that table. In Transbase , the concept of primary keys and clustering B*-Tree index keys are not separated. Thus, for the fact table, the PRIMARY HCKEY consisting of primary dimensions is equal to the primary key of the table. This sometimes causes misunderstandings.

Usually, the data in a table is unique w.r.t. the primary key of the data. For fact tables, the primary key is the combination of all dimensions. When choosing a subset of the dimensions as primary dimensions, the primary key of the table is only the combination of this subset of dimensions and in most cases is not unique for the table. Thus, Transbase provides the construction of so called NOT UNIQUE primary keys. The following SQL statement defines two dimensions as primary dimensions and one dimension as secondary dimension:

create table fact (
  product char(*) NOT NULL
       references product (item) ON UPDATE CASCADE,
  store char(*)NOT NULL
       references location(store) ON UPDATE CASCADE,
  time integer NOT NULL
       references date(day) ON UPDATE CASCADE,
  sales numeric(10,3),
  price numeric(10,3),
  quantity numeric(10,3),
  SURROGATE cs_prod FOR product,
  SURROGATE cs_store FOR store,
  SURROGATE cs_time FOR time,
  PRIMARY HCKEY NOT UNIQUE (cs_prod, cs_time)
)

The dimensions PRODUCT and DATE are primary dimensions, the dimension LOCATION is a secondary dimension. The definition of primary and secondary dimensions impacts the star query execution, because the predicates on secondary dimensions cannot be used to efficiently process the multidimensional query box on the Hypercube index. However, there are some optimization techniques regarding the usage of secondary dimensions which are not explained here in detail.

7.2. Loading a Data Warehouse

Fast loading of data plays an important role in a data warehouse. Usually there is one initial load at the beginning of a "warehouse life" and a periodical incremental load of data (e.g., daily, weekly, monthly). Loading data into Transbase can be done via INSERT or SPOOL statements.

The INSERT statement inserts records into a table:

INSERT INTO T VALUES(1, 'abc');

Inserts a record (1, 'abc') into table T. Since with each INSERT statement one single record is inserted into the table, a large number of such statements (can be parametrized) is required to load the tables. The performance may degenerate because of a high communication overhead.

The alternative is to use the SPOOL statement (see also Transbase SQL manual) that loads a large amount of data stored in flat files:

SPOOL T FROM /tmp/t_data.spf;

This statement inserts all records stored in the file '/tmp/t_data.spf'. The file is read completely before inserting the data. Therefore, all data is handled by one single statement which reduces the communication costs correspondingly. The records are sorted w.r.t. the primary clustering key and then the data is inserted into the corresponding table. With this method, the insert performance is much better than single inserts and should be used for large fact table and dimension table load. Please note that no intermediate COMMIT is executed and has to be called explicitly after the statement.

For data warehouse applications, some additional information how to handle INSERT and SPOOL is necessary and is discussed in the following.

7.2.1. Dimension Tables

The dimension tables usually contain one or more compound surrogates. These surrogate fields are maintained and computed automatically and must not be specified in INSERT or SPOOL statements. Therefore, a NULL value should be used for these fields.

For INSERT, you either can specify a field list containing the values of the INSERT statement and omit the surrogate fields:

INSERT INTO dim_date(year, month, day, id) VALUES (2004, 10, 20, 734);

Alternatively you can explicitly specify a NULL value for the surrogate fields:

INSERT INTO dim_date VALUES (2004, 10, 20, 734, NULL);

In both cases, the surrogate is computed w.r.t. the hierarchy definition of the dimension dim_date. Note that the surrogate is taken without validating the correctness, if you specify an explicit value. In such a case, the hierarchy information may be inconsistent and wrong results may occur.

It is also possible to use the SQL statement INSERT INTO ... SELECT FROM ... In this case, you also must specify the field list or use an explicit NULL value for the surrogate field:

INSERT INTO dim_date(year, month, day, id)
  SELECT year, month, day, id FROM ...
or
INSERT INTO dim_date SELECT year, month, day, id, NULL FROM ...

The same holds for the SPOOL statement. However, it is not possible to specify a field list in the SPOOL statement. Thus, you have to specify the value for the compound surrogates again as NULL value. Note that for spool flat files, the NULL value is represented as '?' (question mark in spool files). The following file is an example for the SPOOL statement:

SPOOL dim_date from flat.dat delim is ','
2003,8,12,12345,?
2003,8,13,12346,?
2003,8,14,12347,?

For dimension tables, where the compound surrogates are the last fields of the table, you can omit the NULL value in the spool file:

SPOOL dim_date from flat.dat delim is ','
2003,8,12,12345
2003,8,13,12346
2003,8,14,12347

It is also possible to spool data into a view that for example does not include the surrogate field. In this case, the view must be updatable. Usually such a view contains all fields of the dimension table except for the compound surrogates.

7.2.2. Fact Tables

Loading the fact table usually is more time critical than loading the dimension tables, because the amount of data is higher in orders of magnitude. An initial load may insert several million records (or even more). For the INSERT statement, the same holds as for dimension tables. You also have to specify NULL values for the reference surrogates. We do not show the INSERT statements here because they are analogous to Section Loading Dimension Tables .

For the SPOOL statement, the specification of reference surrogates also is the same as for dimension tables with the compound surrogates. However, further options should be specified in the SPOOL statement to additionally speed up the insert. One is the option LOCAL that can be used, if the (usually large spool file) is stored locally on the server and does not necessarily require a copy from the client to the server. Note that the complete path for the spool file must be specified in this case:

SPOOL FACT FROM /tmp/fact.spf LOCAL;

The second thing is that the sorter cache must be large enough to provide the special MHC reference surrogate optimization (see Section Database Properties for more details).

Usually the spool files are very large and may be more than 2 GB. For most platforms this does not cause any problems for Transbase . Differential data load usually contains much less than this amout of data. Note that Transbase Hypercube also can load data continually (during online phases) while other users run reports and queries on the data warehouse (online or realworld data warehouse).

7.3. Maintaining a Data Warehouse

There are some operations that require a more detailed description for data warehouse applications and Transbase Hypercube implementations. Some of them are also described in the standard Transbase manuals like backup and recovery but should be mentioned here because of the different database environment compared to other database applications.

In particular, we discuss the following topics:

Database availability
Backup
Parallel reading and writing
Deleting old data

7.3.1. Database Availability

Usually a data warehouse contains historic data and is used to run analyses on these data. However, sometimes daily business processes depend on these analyses and the data warehouse must be available 24 hours a day, 7 days a week etc. This 24x7 availability characteristic is supported by Transbase (so called standby concept: physical or logical replication concept).

The logical standby or replication concept bases on the comparably simple implementation that two databases exist on different servers where all operations with the data (INSERT, UPDATE, DELETE) are performed on both databases. The databases are called active and shadow database, where the active database is accessed by the user and the shadow database is switched to the active database when a database crash or system failure occur. For a logical standby concept, the ETL must be implemented to fill both databases. In this case, you can also switch between the databases when loading the data, such that the data is loaded into the shadow database. After the load phase, the shadow database is switched to the active database and the active database is switched to the shadow database and loaded analogously. In such a scenario, the users are not concerned by the load operation.

The physical standby concept uses standard database mechanisms for the implementation of a standby database. In Transbase there are basically two different logging concepts: before image logging and delta logging. Delta logging writes log files for each logical database operation. These log files are permanent and can be used to run the same operations on the shadow database by simply applying the log files to the database. This can be automated easily and provides a robust standby database concept. For more information about this mechanism please refer to the corresponding Transbase manuals.

7.3.2. Backup

Since the data in a data warehouse usually is very huge, a regular database backup is necessary in practical implementations. However, consider the space requirements for a full backup. Transbase supports both - full and incremental backups. We recommend that you run a full backup in longer intervals (e.g., monthly or weekly) and run the differential backup each day.

The full backup can be executed online (other users can work on the database during the backup process). Be sure that enough disk space (or tape) is available for the full backup!

The differential backup is also supported by Transbase and can be processed online, too. Transbase supports automatic organization of the differential backups (corresponding directory structure) and automatic recovery in the case of a system failure. See the system guide manual for more information about backup and recovery.

7.3.3. Parallel Reading and Writing

When data is loaded into the warehouse during online times, i.e., where users run queries on the database, special care must be taken for parallel reading and writing. Generally, multiple readers can access the same table without blocking. However, for standard consistency (isolation) level 3, long read transactions prohibit write commit operations (the last reader must have committed before the write transaction can commit). See the System Guide for more information about the locking protocols. Generally, a large writing operation consumes a lot of system resources and therefore will affect the readers. Standard Transbase applications run with the highest consistency mode. This means that the highest so called consistency level is applied for the transactions of the read and write queries. This consistency mode can lock table or index pages and reduce parallelism of reading and writing operations. In such scenarios, the consistency level can be reduced, in order to increase parallelism. See the corresponding interface documentation how to define the consistency levels. Especially for a scenario where data is loaded into the fact table frequently (e.g., hourly, in general during read access), setting the consistency level to CONSISTENCY 1 allows for parallel reading and writing. A write transaction can commit without waiting for the end of all read transactions. Please note that in such a case, the reader can see inconsistent transactional data, if the new data is part of the query result and the server cursor accesses a table page with new insterted data (no cursor consistency!).

7.3.4. Deleting old Data

Most data warehouses are implemented to only load data, not to delete unused data. However, the size of such data warehouses often grow very much and data should be deleted (e.g., to handle backups). Deleting data from a table is done by the SQL statement DELETE. Depending on the amount of data in the fact table, such a delete operation can take a very long time, because the table is eventually partially reorganized (depending on the B*-Tree effects). Since the physical clustering provided by the Hypercube index often is quite good, a deletion of fact table records of a special time period (which is often the predicate of a delete operation) may cause only minor reorganization. However, the operation can take a long time.

7.4. Analyzing the Data Warehouse

Almost every data warehouse application has performance problems after some time period. In such cases, the data warehouse must be analyzed. Since most operations are reading operations (star queries), these queries must be analyzed exactly to find the reasons for the performance problems.

The main reasons for performance problems are

Bad clustering
Many query boxes
Badly optimized queries

These items are explained in more detail in the following. For our explanations we use the schema given below:

-- DIMENSION TABLE PRODUCT
CREATE TABLE PRODUCT
(
  DWH_KEY  INTEGER NOT NULL,
  BRAND_CODE  VARCHAR(3),
  BRAND_LONG  VARCHAR(30),
  PRODUCTGROUP_CODE  VARCHAR(14) NOT NULL,
  PRODUCTGROUP_LONG  VARCHAR(30),
  CATEGORY_CODE VARCHAR(14) NOT NULL,
  CATEGORY_LONG VARCHAR(30),
  SURROGATE CS COMPOUND ( CATEGORY_CODE SIBLINGS 24,
    PRODUCTGROUP_CODE SIBLINGS 44, DWH_KEY SIBLINGS 1250),
)  KEY IS DWH_KEY
;

-- DIMENSION TABLE CALENDAR
CREATE TABLE CALENDAR
(
  DWH_KEY INTEGER NOT NULL,
  CALENDAR_DATE DATETIME[YY:DD],
  DAY_OF_WEEK CHAR(1),
  WEEK_NUM_IN_YEAR INTEGER,
  MONTH_NUM CHAR(2),
  MONTH_YEAR CHAR(7) NOT NULL,
  QUARTER_NUM CHAR(2),
  QUARTER_YEAR CHAR(6) NOT NULL,
  HALF_YEAR CHAR(1),
  HALF_YEAR_YEAR CHAR(6) NOT NULL,
  YEAR_NUM CHAR(4) NOT NULL,
  SURROGATE CS COMPOUND ( YEAR_NUM SIBLINGS 8,
    HALF_YEAR_YEAR SIBLINGS 2, QUARTER_YEAR SIBLINGS 2,
    MONTH_YEAR SIBLINGS 3, DWH_KEY SIBLINGS 31),
)  KEY IS DWH_KEY
;

-- DIMENSION TABLE WAREHOUSE
CREATE TABLE WAREHOUSE
(
  DWH_KEY INTEGER NOT NULL,
  ADDRESS VARCHAR(35),
  STREET_NUMBER VARCHAR(13),
  POST_CODE VARCHAR(10),
  AREA_CODE VARCHAR(3) NOT NULL,
  CITY_CODE VARCHAR(3) NOT NULL,
  COUNTY_CODE VARCHAR(3) NOT NULL,
  GEODEPT_CODE VARCHAR(3) NOT NULL,
  COUNTRY_CODE VARCHAR(3) NOT NULL,
  AREA_NAME VARCHAR(40),
  AREA_POPULATION VARCHAR(30),
  CITY_NAME VARCHAR(30),
  CITY_POPULATION VARCHAR(30),
  COUNTY_NAME VARCHAR(30),
  COUNTY_POPULATION VARCHAR(30),
  GEODEPT_NAME VARCHAR(30),
  COUNTRY_NAME VARCHAR(30),
  COUNTRY_ABBREVIATION VARCHAR(4),
  COUNTRY_POPULATION VARCHAR(10),
  SURROGATE CS COMPOUND ( COUNTRY_CODE SIBLINGS 2,
    GEODEPT_CODE SIBLINGS 8, COUNTY_CODE SIBLINGS 8,
    CITY_CODE SIBLINGS 8, AREA_CODE SIBLINGS 32,
    DWH_KEY SIBLINGS 128),
)  KEY IS DWH_KEY
;

-- FACT TABLE
CREATE TABLE FACT WITH IKACCESS
(
  PRD_DWH_KEY INTEGER NOT NULL
    REFERENCES PRODUCT(DWH_KEY) ON UPDATE CASCADE,
  CAL_DWH_KEY INTEGER NOT NULL
    REFERENCES CALENDAR(DWH_KEY) ON UPDATE CASCADE,
  WHS_DWH_KEY INTEGER NOT NULL
    REFERENCES WAREHOUSE(DWH_KEY) ON UPDATE CASCADE,
  UNIT_COST NUMERIC(25,5),
  UNIT_SALE_PRICE NUMERIC(25,5),
  VAL_GROSS NUMERIC(25,5),
  VAL_TOTAL NUMERIC(25,5),
  SURROGATE CSP FOR PRD_DWH_KEY REFERENCES PRODUCT(CS),
  SURROGATE CSW FOR WHS_DWH_KEY REFERENCES WAREHOUSE(CS),
  SURROGATE CSD FOR CAL_DWH_KEY REFERENCES CALENDAR(CS),
  PRIMARY HCKEY (CSP, CSW, CSD)
)
;

The test schema has three dimensions PRODUCT, CALENDAR, WAREHOUSE, each containing a hierarchy, and a fact table with these three dimensions as clustering dimensions. The following table contains the cardinalities for the dimension tables and the fact table:

Table 7.1. Cardinalities

PRODUCT	27.929
CALENDAR	2.922
WAREHOUSE	226
FACT TABLE	8.538.143

7.4.1. Bad Clustering

Bad clustering means that the physical clustering does not match the "slow" queries. This means that only a small number of records stored on the pages that are read from disk is in the result set of the Hypercube accesses. In such a situation, the query is difficult to improve without changing the physical schema of the fact table. Bad clustering can happen when only a small number of dimensions is restricted in the query that are used for physical clustering (PRIMARY HCKEY in the fact table) or if the overall number of physical dimensions of the fact table is very high (higher than 6).

The general recommendation is to reduce the number of dimensions (especially try to find the important ones and remove the remaining ones from the PRIMARY HCKEY clause). Also do not use dimensions with a very low cardinality (e.g., a dimension with only 5 used distinct values).

You can recognize bad clustering when looking at the operator tree generated for the corresponding query. The following query is a standard query on the schema described above with hierarchical restrictions in three dimensions. The semantic is the following: Return the sum of the turnover for year 1999 for a special product group and a special city grouped by the category and county:

SELECT
  CAL.YEAR_NUM, PRD.CATEGORY_LONG, W.COUNTY_CODE, SUM(VAL_GROSS)
FROM
  FACT F, PRODUCT PRD, CALENDAR CAL, WAREHOUSE W
WHERE
  F.CAL_DWH_KEY=CAL.DWH_KEY AND F.PRD_DWH_KEY=PRD.DWH_KEY AND
  F.WHS_DWH_KEY=W.DWH_KEY AND

  CAL.YEAR_NUM = '1999'  AND
  PRD.PRODUCTGROUP_LONG = 'RZBnwiayPm'
  AND W.CITY_CODE = '2'
GROUP BY
  CAL.YEAR_NUM, PRD.CATEGORY_LONG, W.COUNTY_CODE;

The operator tree of the query is generated and stored via the application interfaces. For example, in tbi you can look at the operator tree setting the environment PLANS to ON:

tbi> SET PLANS ON
tbi> x query.sql
tbi> e plan.txt

See the Transbase tbi manual for more information (and the programming interfaces for more information about retrieving the operator trees into application programs).

The following operator tree is the result of the query above:

(N0:sel { 4 "YEAR_NUM" "CATEGORY_LONG" "COUNTY_CODE" "column_3" } --#tup: 0
 (N1:times { proj 4(6 2 3 4) groupvaluejoin } --#tup: 2
  (N2:group { ghash [ 1 4 5 ] sum[2] repres[3 1] } --#tup: 2
   (N3:times { proj 5(1 2 4 5 6) groupexactjoin } --#tup: 196
   (N4:times { proj 5(1 4 5 6 7) groupexactjoin } --#tup: 196
    (N5:group { ghash [ 1 2 3 ] sum[4] repres[5 3] repres[6 -1] } --#tup: 196
     (N6:proj --#tup: 23015
      (N7:rel { "FACT" proj 6(1 2 3 6 9 10) } --#pages(index/leaf/nohit):19/467/21 --#tup: 23015
      (N8:times { keyaccess } --#tup: 2
       (N9:times { keyaccess } --#tup: 2
        (N10:ivmk { betw } --#tup: 1
         (N11:cs2ival { "PRODUCT" 8 } --#tup: 1
         (N12:sort { +1 , } --#tup: 422
          (N13:proj --#tup: 422
           (N14:restr --#tup: 422
            (N15:rel { "PRODUCT" proj 2(5 8) } --#tup: 27929 
            )
            (N16:eq { }
            (N17:attr { N14[1] } )
            (N18:const { 'RZBnwiayPm' char(10) } )))
           (N19:build
            (N20:attr { N13[2] } ))))))
        (N21:ivmk { betw } --#tup: 2
         (N22:cs2ival { "WAREHOUSE" 20 } --#tup: 2
         (N23:sort { +1 , } --#tup: 31
          (N24:proj --#tup: 31
           (N25:restr --#tup: 31
            (N26:index { "@@SYS_SURRHX_3" proj 3(4 7 8) user_tablename="WAREHOUSE" } --#tup: 226
            )
            (N27:eq { }
            (N28:attr { N25[1] } )
            (N29:const { '2' char(1) } )))
           (N30:build
            (N31:attr { N24[2] } )))))))
       (N32:ivmk { betw } --#tup: 1
        (N33:sort { +1 , } --#tup: 1
         (N34:proj --#tup: 1
         (N35:uniq { } --#tup: 1
          (N36:proj --#tup: 365
           (N37:index { "@@SYS_SURRHX_2" proj 2(6 7) user_tablename="CALENDAR" } --#tup: 365
            (N38:ivmk { eq } --#tup: 1
            (N39:const { '1999' char(4) } )))
           (N40:build
            (N41:subrg { false }
            (N42:attr { N36[1] } )
            (N43:const { 1 integer } )
            (N44:const { 3 integer } )))))
         (N45:build
          (N46:bitor
           (N47:attr { N34[1] } )
           (N48:const { 0b000000000000 bits2(12) } ))
          (N49:bitor
           (N50:attr { N34[1] } )
           (N51:const { 0b000111111111 bits2(12) } ))))))))
      (N52:build
      (N53:subrg { false }
       (N54:attr { N6[6] } )
       (N55:const { 1 integer } )
       (N56:const { 3 integer } ))
      (N57:attr { N6[1] } )
      (N58:subrg { false }
       (N59:attr { N6[5] } )
       (N60:const { 1 integer } )
       (N61:const { 7 integer } ))
      (N62:attr { N6[4] } )
      (N63:attr { N6[3] } )
      (N64:attr { N6[2] } ))))
    (N65:rel { "PRODUCT" proj 1(7) } --#tup: 196
     (N66:ivmk { eq } --#tup: 1
      (N67:attr { N4[2] } ))))
   (N68:rel { "WAREHOUSE" proj 1(7) } --#tup: 196
    (N69:ivmk { eq } --#tup: 1
     (N70:attr { N3[3] } )))))
  (N71:rel { "CALENDAR" proj 1(11) } --#tup: 2
   (N72:ivmk { eq } --#tup: 1
   (N73:attr { N1[5] } )))))
{nosort}

The reader finds a description of the basic concepts of operator trees in the Transbase performance manual. You have to understand the structure of these operator trees. An operator tree is processed from bottom to top and each node has one or more children. The parantheses show how far the node reaches and represents the tree structure. In this manual, we only discuss some MHC relevant parts of the operator tree. Since the single nodes of operator trees have numbers (N0, N1, N2, ...), we use them to reference the corresponding parts in our description.

The center of the star query processing is the fact table access. In the example above, the corresponding node is N7:

(N7:rel  { "FACT"  proj 6(1 2 3 6 9 10) }
        --#pages(index/leaf/nohit):19/467/21  --#tup: 23015

It can be recognized by the node type rel followed by the name of the fact table (here FACT) and a kind of statistics about the phyiscal access of the underlying B*-Tree (Hypercube index):

--#pages(index/leaf/nohit):19/467/21  --#tup: 23015

Note that a node rel followed by the fact table name can occur also somewhere else in the operator tree (if the query contains self joins etc.), but the statistics usually points to the fact table access. This statistics shows how good the clustering is. In our case, the clustering is excellent, because the number of read leaf pages (necessary for the evaluation of the multidimensional query boxes on the Hypercube index) is very large compared to the number of nohit pages, (467-21)=446 vs. 21. This means, that less than 5% of all B*-Tree leaf pages read to evaluate the range queries did not contain hits. In the overall, 23015 records were in the result of the range query (defined by the restrictions of the dimensions). This means that 23015/467=49,3 records per page with hits contribute to the query result (including the non hitting pages). Compared to the average number of records per page of 106 (can be determined by the tbcheck tool) this is a very good so called clustering factor.

If there are statistics numbers like

--#pages(index/leaf/nohit):19/467/200  --#tup: 23015

the clustering is bad, because only 267 pages contain hits and 200 pages do not. We say that the clustering factor is very good, if the number of non-hitting pages is below 10% of the numbers of the overall number of read leaf pages. A clustering factor of 20% may be acceptable and for some not so important queries even worse clustering factors. But if the clustering factor becomes too bad, you should consider to reorganize the fact table and modify the clustering properties defining different dimensions as clustering dimensions.

Note that the number of index page reads (in the example above 19) usually is quite low. A very high number indicates a large number of query boxes as described in the next section.

7.4.2. Many Query Boxes

A different performance problem is indicated by a very high cpu load with very few physical I/O operations. Due to special predicates on dimensions within a star query, a bad query characteristics may follow. For example, if you restrict feature attributes of a dimension table that results in a large number of intervals (and if you have even more than one restricted dimensions with such characteristics) the overall number of generated query boxes is very high (can be several ten thousands or even more). The higher the number of query boxes is the more expensive is to find hitting records. Many of these query boxes do not have hits and therefore are evaluated without retrieving result records.

Such a situation can be recognized looking at operator trees, too. This is indicated by the operator tree node following the fact table access (in the example below the node N8). This node should be always a times node. It defines the number of the query boxes:

(N8:times  {   keyaccess }   --#tup: 785

In this case, there are 785 query boxes. This number comes from the combination of feature predicates on the dimensions CALENDAR and WAREHOUSE as shown in the query below. The sample query above is extended by a feature restriction restricting the weekday to single weekdays and therefore "cutting" holes into the interval over the complete year 1999. The other restriction forms single intervals instead of one large interval on the dimension WAREHOUSE with the predicate

area_code in ('10', '12', '22', '20', '1')

where the interval is split into smaller intervals on the next finer level.

SELECT
  CAL.YEAR_NUM, PRD.CATEGORY_LONG, W.COUNTY_CODE, SUM(VAL_GROSS)
FROM
  FACT F, PRODUCT PRD, CALENDAR CAL, WAREHOUSE W
WHERE
  F.CAL_DWH_KEY=CAL.DWH_KEY AND F.PRD_DWH_KEY=PRD.DWH_KEY AND
  F.WHS_DWH_KEY=W.DWH_KEY AND

  CAL.YEAR_NUM = '1999'  AND  DAY_OF_WEEK IN ('2','4','6') AND
  PRD.PRODUCTGROUP_LONG = 'RZBnwiayPm'
  AND W.CITY_CODE = '2' and
  AREA_CODE in ('10', '12', '22', '20', '1')
GROUP
  BY CAL.YEAR_NUM, PRD.CATEGORY_LONG, W.COUNTY_CODE

The high number of query boxes is the product of the number of intervals from the CALENDAR dimension (157) and WAREHOUSE dimension(5): 157*5=785. The number of intervals from each dimension is indicated by the ivmk nodes (interval make nodes) at special positions in the operator tree. Note that there are many positions in the operator tree, where such nodes occur, but only special nodes indicate the number of intervals. Usually, for MHC star queries, the operator tree contains the dimension predicates in the lower and middle part of the tree followed by a times cascade immediately below the fact table access. The times cascade combines the intervals of the single dimensions to the query boxes and gives information about the number of overall intervals. The sons of the times nodes are usally ivmk nodes. In the tree above, the PRODUCT dimension returns one interval:

(N10:ivmk  {  betw   }   --#tup: 1

The WAREHOUSE dimension returns 5 intervals:

(N21:ivmk  {  betw   }   --#tup: 5

And the CALENDAR dimension returns 157 interval:

(N40:ivmk  {  betw   }   --#tup: 157

Often the ivmk nodes are parents of cs2ival nodes that are used to reduce the number of intervals per dimension (if possible). But these nodes need not necessarily exist for hierarchical prefix predicates.

These explanations should be enough so far. Here is the complete operator tree for the query with many query boxes:

(N0:sel { 4 "YEAR_NUM" "CATEGORY_LONG" "COUNTY_CODE" "column_3" } --#tup: 0
 (N1:times { proj 4(6 2 3 4) groupvaluejoin } --#tup: 2
  (N2:group { ghash [ 1 4 5 ] sum[2] repres[3 1] } --#tup: 2
   (N3:times { proj 5(1 2 4 5 6) groupexactjoin } --#tup: 172
   (N4:times { proj 5(1 4 5 6 7) groupexactjoin } --#tup: 172
    (N5:group { ghash [ 1 2 3 ] sum[4] repres[5 3] repres[6 -1] } --#tup: 172
     (N6:proj --#tup: 11962
      (N7:rel { "FACT" proj 6(1 2 3 6 9 10) } --#pages(index/leaf/nohit):1571/379/5 --#tup: 11962
      (N8:times { keyaccess } --#tup: 785
       (N9:times { keyaccess } --#tup: 5
        (N10:ivmk { betw } --#tup: 1
         (N11:cs2ival { "PRODUCT" 8 } --#tup: 1
         (N12:sort { +1 , } --#tup: 422
          (N13:proj --#tup: 422
           (N14:restr --#tup: 422
            (N15:rel { "PRODUCT" proj 2(5 8) } --#tup: 27929
            )
            (N16:eq { }
            (N17:attr { N14[1] } )
            (N18:const { 'RZBnwiayPm' char(10) } )))
           (N19:build
            (N20:attr { N13[2] } ))))))
        (N21:ivmk { betw } --#tup: 5
         (N22:cs2ival { "WAREHOUSE" 20 } --#tup: 5
         (N23:sort { +1 , } --#tup: 18
          (N24:proj --#tup: 18
           (N25:restr --#tup: 18
            (N26:index { "@@SYS_SURRHX_3" proj 4(4 5 7 8) user_tablename="WAREHOUSE" } --#tup: 226
            )
            (N27:and
            (N28:eq { }
             (N29:attr { N25[1] } )
             (N30:const { '2' char(1) } ))
            (N31:inls
             (N32:attr { N25[2] } )
             (N33:const { '10' char(2) } )
             (N34:const { '12' char(2) } )
             (N35:const { '22' char(2) } )
             (N36:const { '20' char(2) } )
             (N37:const { '1' char(1) } ))))
           (N38:build
            (N39:attr { N24[3] } )))))))
       (N40:ivmk { betw } --#tup: 157
        (N41:cs2ival { "CALENDAR" 12 } --#tup: 157
         (N42:sort { +1 , } --#tup: 157
         (N43:proj --#tup: 157
          (N44:restr --#tup: 157
           (N45:mat { "CALENDAR" } --#tup: 365
            (N46:index { "@@SYS_SURRHX_2" proj 1(7) user_tablename="CALENDAR" } --#tup: 365
            (N47:ivmk { eq } --#tup: 1
             (N48:const { '1999' char(4) } ))))
           (N49:inls
            (N50:attr { N44[3] } )
            (N51:const { '2' char(1) } )
            (N52:const { '4' char(1) } )
            (N53:const { '6' char(1) } )))
          (N54:build
           (N55:attr { N43[12] } ))))))))
      (N56:build
      (N57:subrg { false }
       (N58:attr { N6[6] } )
       (N59:const { 1 integer } )
       (N60:const { 3 integer } ))
      (N61:attr { N6[1] } )
      (N62:subrg { false }
       (N63:attr { N6[5] } )
       (N64:const { 1 integer } )
       (N65:const { 7 integer } ))
      (N66:attr { N6[4] } )
      (N67:attr { N6[3] } )
      (N68:attr { N6[2] } ))))
    (N69:rel { "PRODUCT" proj 1(7) } --#tup: 172
     (N70:ivmk { eq } --#tup: 1
      (N71:attr { N4[2] } ))))
   (N72:rel { "WAREHOUSE" proj 1(7) } --#tup: 172
    (N73:ivmk { eq } --#tup: 1
     (N74:attr { N3[3] } )))))
  (N75:rel { "CALENDAR" proj 1(11) } --#tup: 2
   (N76:ivmk { eq } --#tup: 1
   (N77:attr { N1[5] } )))))
{nosort}

There are several alternative improvements for such queries. You can look at the hierarchies and decide whether there is need for more support of the corresponding feature attributes. Then build an own hierarchy and use this new hierarchy as clustering hierarchy in the fact table (remove the old hierarchy or use two hierarchies with all the consequences).

The other solution is to rewrite the query in the following way. Instead of executing the actual star query with this large number of query boxes, use one (or very few) intervals on the corresponding dimension restricting this dimension via suitable hierarchy levels (or do not restrict the dimension at all). In a second step write an outer query block with a join to the result of the inner star query with the dimension table and the final dimension predicate (with the large number of intervals resulting from the feature restrictions). In this case, the outer query block with the additional predicates are joined with the results of the inner star query and only a comparable small number of records are joined. In such a situation the overal query performance is better than before. However, consider that the correct query result is delivered by the new formulated query. You have to group the inner star query in the corresponding dimension w.r.t. the dimension key. Then the groups are correct and can be eliminated by the join with the outer query block. In the following we show a simple example for this query rewriting.

The following query returns the same result as the query before but needs less query boxes (but one additional join):

SELECT
  YEAR_NUM, O_CATEGORY_LONG, W_COUNTY_CODE, SUM(SUM_VAL_GROSS)
FROM
(
  SELECT
    CAL.YEAR_NUM YEAR_NUM, PRD.CATEGORY_LONG AS O_CATEGORY_LONG,
    W.COUNTY_CODE AS W_COUNTY_CODE, SUM(VAL_GROSS) AS SUM_VAL_GROSS,
    W.DWH_KEY AS W_DWH_KEY
  FROM
    FACT F, PRODUCT PRD, CALENDAR CAL, WAREHOUSE W
  WHERE
    F.CAL_DWH_KEY=CAL.DWH_KEY AND F.PRD_DWH_KEY=PRD.DWH_KEY AND
    F.WHS_DWH_KEY=W.DWH_KEY AND

    CAL.YEAR_NUM = '1999'  AND  DAY_OF_WEEK IN ('2','4','6') AND
    PRD.PRODUCTGROUP_LONG = 'RZBnwiayPm'
    AND W.CITY_CODE = '2'
  GROUP BY
    CAL.YEAR_NUM, PRD.CATEGORY_LONG, W.COUNTY_CODE, W.DWH_KEY
),
WAREHOUSE W
WHERE
  W.DWH_KEY=W_DWH_KEY AND
  AREA_CODE IN ('10', '12', '22', '20', '1')
GROUP BY
  YEAR_NUM, O_CATEGORY_LONG, W_COUNTY_CODE
ORDER BY ALL

The operator tree below shows that the number of query boxes was reduced to 314 (from 785 before):

(N12:times  {   keyaccess }   --#tup: 314

Here is the complete operator tree:

(N0:sel { 4 "YEAR_NUM" "O_CATEGORY_LONG" "W_COUNTY_CODE" "column_3" } --#tup: 0
 (N1:sort { user_check_opti +1 +2 +3 +4 , presorted 4 } --#tup: 2
  (N2:group { [ 1 2 3 ] sum[5] } --#tup: 2
   (N3:sort { +1 +2 +3 , } --#tup: 3
   (N4:times { proj 5(1 2 3 4 5) } --#tup: 3
    (N5:times { proj 5(7 2 3 4 5) groupvaluejoin } --#tup: 5
     (N6:group { ghash [ 1 5 6 2 ] sum[3] repres[4 1] } --#tup: 5
      (N7:times { proj 6(1 2 3 5 6 7) groupexactjoin } --#tup: 326
      (N8:times { proj 6(1 4 5 6 7 8) groupexactjoin } --#tup: 326
       (N9:group { ghash [ 1 2 3 4 ] sum[5] repres[6 3] repres[7 -1] } --#tup: 326
        (N10:proj --#tup: 12742
         (N11:rel { "FACT" proj 6(1 2 3 6 9 10) } --#pages(index/leaf/nohit):629/379/3 --#tup: 12742
         (N12:times { keyaccess } --#tup: 314
          (N13:times { keyaccess } --#tup: 2
           (N14:ivmk { betw } --#tup: 1
            (N15:cs2ival { "PRODUCT" 8 } --#tup: 1
            (N16:sort { +1 , } --#tup: 422
             (N17:proj --#tup: 422
              (N18:restr --#tup: 422
               (N19:rel { "PRODUCT" proj 2(5 8) } --#tup: 27929
                )
               (N20:eq { }
               (N21:attr { N18[1] } )
               (N22:const { 'RZBnwiayPm' char(10) } )))
              (N23:build
               (N24:attr { N17[2] } ))))))
           (N25:ivmk { betw } --#tup: 2
            (N26:cs2ival { "WAREHOUSE" 20 } --#tup: 2
            (N27:sort { +1 , } --#tup: 31
             (N28:proj --#tup: 31
              (N29:restr --#tup: 31
               (N30:index { "@@SYS_SURRHX_3" proj 3(4 7 8) user_tablename="WAREHOUSE" } --#tup: 226
                )
               (N31:eq { }
               (N32:attr { N29[1] } )
               (N33:const { '2' char(1) } )))
              (N34:build
               (N35:attr { N28[2] } )))))))
          (N36:ivmk { betw } --#tup: 157
           (N37:cs2ival { "CALENDAR" 12 } --#tup: 157
            (N38:sort { +1 , } --#tup: 157
            (N39:proj --#tup: 157
             (N40:restr --#tup: 157
              (N41:mat { "CALENDAR" } --#tup: 365
               (N42:index { "@@SYS_SURRHX_2" proj 1(7) user_tablename="CALENDAR" } --#tup: 365
               (N43:ivmk { eq } --#tup: 1
                (N44:const { '1999' char(4) } ))))
              (N45:inls
               (N46:attr { N40[3] } )
               (N47:const { '2' char(1) } )
               (N48:const { '4' char(1) } )
               (N49:const { '6' char(1) } )))
             (N50:build
              (N51:attr { N39[12] } ))))))))
         (N52:build
         (N53:subrg { false }
          (N54:attr { N10[6] } )
          (N55:const { 1 integer } )
          (N56:const { 3 integer } ))
         (N57:attr { N10[1] } )
         (N58:subrg { false }
          (N59:attr { N10[5] } )
          (N60:const { 1 integer } )
          (N61:const { 7 integer } ))
         (N62:attr { N10[3] } )
         (N63:attr { N10[4] } )
         (N64:attr { N10[3] } )
         (N65:attr { N10[2] } ))))
       (N66:rel { "PRODUCT" proj 1(7) } --#tup: 326
        (N67:ivmk { eq } --#tup: 1
         (N68:attr { N8[2] } ))))
      (N69:rel { "WAREHOUSE" proj 1(7) } --#tup: 326
       (N70:ivmk { eq } --#tup: 1
        (N71:attr { N7[4] } )))))
     (N72:rel { "CALENDAR" proj 1(11) } --#tup: 5
      (N73:ivmk { eq } --#tup: 1
      (N74:attr { N5[6] } ))))
    (N75:restr --#tup: 3
     (N76:rel { "WAREHOUSE" proj 1(5) } --#tup: 5
      (N77:ivmk { eq } --#tup: 1
      (N78:attr { N4[4] } )))
     (N79:inls
      (N80:attr { N75[1] } )
      (N81:const { '10' char(2) } )
      (N82:const { '12' char(2) } )
      (N83:const { '22' char(2) } )
      (N84:const { '20' char(2) } )
      (N85:const { '1' char(1) } ))))))))
{nosort}

Of course, it is possible to further reduce the number of query boxes. To achieve this, modify the feature predicate for the CALENDAR dimension, too, as done in the following query:

SELECT
  C_YEAR_NUM, O_CATEGORY_LONG, W_COUNTY_CODE, SUM(SUM_VAL_GROSS)
FROM
(
  SELECT
    CAL.YEAR_NUM C_YEAR_NUM, PRD.CATEGORY_LONG AS O_CATEGORY_LONG,
    W.COUNTY_CODE AS W_COUNTY_CODE, SUM(VAL_GROSS) AS SUM_VAL_GROSS,
    W.DWH_KEY AS W_DWH_KEY, CAL.DWH_KEY AS C_DWH_KEY
  FROM
    FACT F, PRODUCT PRD, CALENDAR CAL, WAREHOUSE W
  WHERE
    F.CAL_DWH_KEY=CAL.DWH_KEY AND F.PRD_DWH_KEY=PRD.DWH_KEY AND
    F.WHS_DWH_KEY=W.DWH_KEY AND

    CAL.YEAR_NUM = '1999'  AND
    PRD.PRODUCTGROUP_LONG = 'RZBnwiayPm'
    AND W.CITY_CODE = '2'
  GROUP BY
    CAL.YEAR_NUM, PRD.CATEGORY_LONG, W.COUNTY_CODE,
    W.DWH_KEY, CAL.DWH_KEY
),
WAREHOUSE W, CALENDAR C
WHERE
  W.DWH_KEY=W_DWH_KEY AND C.DWH_KEY=C_DWH_KEY AND
  DAY_OF_WEEK IN ('2','4','6') AND
  AREA_CODE IN ('10', '12', '22', '20', '1')
GROUP BY
  C_YEAR_NUM, O_CATEGORY_LONG, W_COUNTY_CODE
ORDER BY ALL

This query reduced the number of query boxes to 2:

(N13:times  {   keyaccess }   --#tup: 2

but requires one additional join.

Here is the complete operator tree:

(N0:sel { 4 "C_YEAR_NUM" "O_CATEGORY_LONG" "W_COUNTY_CODE" "column_3" } --#tup: 0
 (N1:sort { user_check_opti +1 +2 +3 +4 , presorted 4 } --#tup: 2
  (N2:group { [ 1 2 3 ] sum[6] } --#tup: 2
   (N3:sort { +1 +2 +3 , } --#tup: 429
   (N4:times { proj 6(1 2 3 4 5 6) } --#tup: 429
    (N5:times { proj 6(1 2 3 4 5 6) } --#tup: 822
     (N6:times { proj 6(8 2 3 4 5 6) groupvaluejoin } --#tup: 1359
      (N7:group { ghash [ 1 6 7 2 3 ] sum[4] repres[5 1] } --#tup: 1359
      (N8:times { proj 7(1 2 3 4 6 7 8) groupexactjoin } --#tup: 10687
       (N9:times { proj 7(1 4 5 6 7 8 9) groupexactjoin } --#tup: 10687
        (N10:group { ghash [ 1 2 3 4 5 ] sum[6] repres[7 3] repres[8 -1] } --#tup: 10687
         (N11:proj --#tup: 23015
         (N12:rel { "FACT" proj 6(1 2 3 6 9 10) } --#pages(index/leaf/nohit):19/467/21 --#tup: 23015
          (N13:times { keyaccess } --#tup: 2
           (N14:times { keyaccess } --#tup: 2
            (N15:ivmk { betw } --#tup: 1
            (N16:cs2ival { "PRODUCT" 8 } --#tup: 1
             (N17:sort { +1 , } --#tup: 422
              (N18:proj --#tup: 422
               (N19:restr --#tup: 422
               (N20:rel { "PRODUCT" proj 2(5 8) } --#tup: 27929
               )
               (N21:eq { }
                (N22:attr { N19[1] } )
                (N23:const { 'RZBnwiayPm' char(10) } )))
               (N24:build
               (N25:attr { N18[2] } ))))))
            (N26:ivmk { betw } --#tup: 2
            (N27:cs2ival { "WAREHOUSE" 20 } --#tup: 2
             (N28:sort { +1 , } --#tup: 31
              (N29:proj --#tup: 31
               (N30:restr --#tup: 31
               (N31:index { "@@SYS_SURRHX_3" proj 3(4 7 8) user_tablename="WAREHOUSE" } --#tup: 226
               )
               (N32:eq { }
                (N33:attr { N30[1] } )
                (N34:const { '2' char(1) } )))
               (N35:build
               (N36:attr { N29[2] } )))))))
           (N37:ivmk { betw } --#tup: 1
            (N38:sort { +1 , } --#tup: 1
            (N39:proj --#tup: 1
             (N40:uniq { } --#tup: 1
              (N41:proj --#tup: 365
               (N42:index { "@@SYS_SURRHX_2" proj 2(6 7) user_tablename="CALENDAR" } --#tup: 365
               (N43:ivmk { eq } --#tup: 1
                (N44:const { '1999' char(4) } )))
               (N45:build
               (N46:subrg { false }
                (N47:attr { N41[1] } )
                (N48:const { 1 integer } )
                (N49:const { 3 integer } )))))
             (N50:build
              (N51:bitor
               (N52:attr { N39[1] } )
               (N53:const { 0b000000000000 bits2(12) } ))
              (N54:bitor
               (N55:attr { N39[1] } )
               (N56:const { 0b000111111111 bits2(12) } ))))))))
         (N57:build
          (N58:subrg { false }
           (N59:attr { N11[6] } )
           (N60:const { 1 integer } )
           (N61:const { 3 integer } ))
          (N62:attr { N11[1] } )
          (N63:subrg { false }
           (N64:attr { N11[5] } )
           (N65:const { 1 integer } )
           (N66:const { 7 integer } ))
          (N67:attr { N11[3] } )
          (N68:attr { N11[2] } )
          (N69:attr { N11[4] } )
          (N70:attr { N11[3] } )
          (N71:attr { N11[2] } ))))
        (N72:rel { "PRODUCT" proj 1(7) } --#tup: 10687
         (N73:ivmk { eq } --#tup: 1
         (N74:attr { N9[2] } ))))
       (N75:rel { "WAREHOUSE" proj 1(7) } --#tup: 10687
        (N76:ivmk { eq } --#tup: 1
         (N77:attr { N8[5] } )))))
      (N78:rel { "CALENDAR" proj 1(11) } --#tup: 1359
      (N79:ivmk { eq } --#tup: 1
       (N80:attr { N6[7] } ))))
     (N81:restr --#tup: 822
      (N82:rel { "WAREHOUSE" proj 1(5) } --#tup: 1359
      (N83:ivmk { eq } --#tup: 1
       (N84:attr { N5[4] } )))
      (N85:inls
      (N86:attr { N81[1] } )
      (N87:const { '10' char(2) } )
      (N88:const { '12' char(2) } )
      (N89:const { '22' char(2) } )
      (N90:const { '20' char(2) } )
      (N91:const { '1' char(1) } ))))
    (N92:restr --#tup: 429
     (N93:rel { "CALENDAR" proj 1(3) } --#tup: 822
      (N94:ivmk { eq } --#tup: 1
      (N95:attr { N4[5] } )))
     (N96:inls
      (N97:attr { N92[1] } )
      (N98:const { '2' char(1) } )
      (N99:const { '4' char(1) } )
      (N100:const { '6' char(1) } ))))))))
{nosort}

It is possible to reduce the number of query boxes for every critical query. In practical cases, there are often several ten thousands of query boxes. Then the effect may reduce the overall query execution time possibly to 30%. This always depends on the properties of the query, the complexity, the number of overall records read from the fact table, the number of groups resulting from the first and second grouping etc.

7.4.3. Badly optimized Queries

Sometimes queries run slowly because they are not recognized to be a star query. The optimizer needs special schemata and query pattern to recognize that the query is a star query and can be executed using the special techniques of MHC (Hypercube query box access, pre-grouping etc.). Whether the optimizer recognized the query correctly can be decided only when examining the operator tree. See Section Bad Clustering how to retrieve the operator tree.

There are some patterns that occur in a operator tree that is optimized with MHC techniques. The easiest way is to look for the following entry:

...
   (N7:rel  { "FACT"  proj 6(1 2 3 6 9 10) }
        --#pages(index/leaf/nohit):3040/467/33  --#tup: 8423
...

The name of the table (in this case "FACT") must be the name of the fact table and there must be some statistics about the access to the B* tree pages: "#pages(index/leaf/nohit):". This indicates that a Hypercube access was executed (see also Section Bad Clustering for more information how to interpret these statistics.

Another indication for a correct MHC optimization is the use of the subrg node:

...
  (N61:subrg  { false }
    (N62:attr  { N6[6] } )
    (N63:const  { 1  integer } )
    (N64:const  { 3  integer } ))
...

This is used to find the corresponding prefix path within the compound surrogates, in order to evaluate the dimension predicates and find the correct intervals for the Hypercube access.

In many cases, the following pattern (cs2ival node) is in a MHC optimized query:

...
  (N30:cs2ival  { "WAREHOUSE" 20 }  --#tup: 2
    (N31:sort  {  +1 , }   --#tup: 31
      (N32:proj   --#tup: 31
        (N33:restr   --#tup: 31
...

In this case, an interval optimization algorithm is used to reduce the number of intervals for the dimension WAREHOUSE as much as possible.

There are many other indications for correct MHC queries, but with these three checks, you can determine very surely that the query is optimized correctly. When you find that the query is not optimized properly, check if the query formulation is a correct star query (joins, predicates etc.). If it is still too slow after rewriting it, please check for proper clustering and reasonable query box handling as explained before.

7.5. Tool support

Since the DDL statements for MHC schemata are very specific for Transbase , there are no commercial or free tools that fully support Transbase MHC schema design. Transaction Software GmbH, implements the so called Data Model Editor (DME) which can be used to also browse a MHC schema. In this section we show some of the functionality. In particular we show some screenshots how to analyze a given MHC schema. Note that the current DME version is a very early version and still requires lot of implementation.

Figure 7.1. Basic Warehouse View

Figure: Basic Warehouse View shows the overview of the warehouse in the database tbhcdemo@localhost. The warehouse can be browsed in parallel to the tables, views etc. When double-clicking onto the warehouse entry in the tree view, all information about the warehouse are retrieved from the database. If the warehouse is complex (with several star schemata), this can take some time. The information is cached in the main memory and further access to the warehouse is very fast.

The warehouse view lists all star schemata as they are recognized in the database. The prerequisites to recognize a star schema is the correct implementation of a MHC schema in a Transbase Hypercube database. In Figure: DME Warehouse there is one star schema with the fact table "FACT" and the three dimensions CALENDAR, PRODUCT, WAREHOUSE. All of the dimensions are clustering dimensions (indicated by "(C)" behind the dimension name). Dimensions that do not contribute to the physical clustering but are also referenced by the fact table, are denoted by "(N)" and are marked in a different color. Note that the names of the fact table and dimensions are the same as the table names.

From this view it is possible to browse the fact table and the dimensions.

Figure 7.2. Fact Table View

The properties of the fact table are shown when double-clicking onto the fact table. All table attributes (but not the reference surrogates) are shown as children nodes in the tree view with the corresponding data type definitions. Further, the standard detail table pane is opendend, in order to get all information about the fact table (attributes with data type, NULL definitions, default values, ranges and scales etc.). Also the additional information such as constraints and indexes is given. See the documentation of the data model editor for more information about these. In the detail table pane also the reference surrogates are show, but are not marked as surrogates. They are presented as standard table attributes with the data type BITS2. You also can look at the CREATE TABLE statement by pressing the button "Show DDL" (see Figure: Reference Surrogates of Fact Table ).

A special feature is the tab pane for the surrogates. The details of the reference surrogates for the fact table are shown in Figure: Reference Surrogates of Fact Table . The name, the type of the surrogate and the surrogate definition are shown in this panel.

Figure 7.3. Reference Surrogates of Fact Table

When browsing the dimensions, the dimension table itself and the hierarchies defined on the dimension via compound surrogate clauses can be viewed. Figure: Dimension Calendar shows an example for the dimension CALENDAR. When double-clicking onto the hierarchy entry (in our example, the dimension has only one hierarchy) the hierarchy is shown from top to down (YEAR_NUM - HALF_YEAR_YEAR - QUARTER_YEAR - MONTH_YEAR - DWH_KEY) with the number of maximum siblings as defined in the surrogate clause in parantheses.

Figure 7.4. Dimension Calendar

When clicking onto the dimension table (CALENDAR in Figure: Dimension Table Calendar ) the detail table pane is shown for the physical dimension table with all information about attributes, constraints, indexes etc. (the same as for the fact table). The children of the dimension table entry in the tree view on the left again shows the attributes of the dimension table with the corresponding data types. However, the compound surrogates are not shown here. They occur in the detail table pane as attributes with data type BITS2 and in the surrogate tab panel as shown in Figure: Compound Surrogate Dimension Calendar . In this figure we the compound surrogate definition is shown as detail view.

The detail table panels are also available when cklicking onto a table in the standard tables view of the database. If the table has surrogates, they are shown in detail. Again, the CREATE TABLE statement is shown by pressing "Show DDL".

Figure 7.5. Dimension Table Calendar

Figure 7.6. Compound Surrogate Dimension Calendar

The data model editor is quite useful to get a quick view on a (even complex) data warehouse schema in a Transbase Hypercube database. It is still in development and shall contain useful tools in the future to further analyze and build a data warehouse system, analyze queries and give optimization and tuning hints. It is considered to be the strategic tool for handling data warehouse schemata on Transbase Hypercube databases.

7.6. Database Parameters for Data Warehouses

The characteristics of a data warehouse database usually are different compared to other database applications. In data warehouses, the amount of data is very high, the important queries are SELECT queries and special query patterns must be supported especially. This also influences the optimal database configuration in terms of database parameters. Of course, it is impossible to recommend the optimal database parameters within this manual, since each database requires specific tuning, but some common statements and hints can be given.

Tuning with database parameters requires a good understandig of database concepts, of Transbase Hypercube, and of the concept of MHC. The most important parameters are

page size
logging mode
caches

In the following we discuss each of them separately.

7.6.1. Page Size

The page size is the finest granularity that is used for physical I/O operations by the database system. For example, if the page size is 4KB, then a number of disk blocks that are together 4KB is read with one read operations or written with one write operation. The page size is also the size for B*-Tree pages. For example, the page size must be at least as large as the largest record that must be stored in the database (except BLOBs). Note that the page size must be set when creating the database and cannot be changed later.

Under no circumstances, the page size should be set smaller than the disk's physical block size. This block size depends on the platform and the operating system (or format options). Transbase I/O access pattern is random for read/write in the disk files (e.g., index and table pages) and sequential for writing into the log files. Thus, when using a raid system, one can configure it in the following way:

RAID 5: disk files
RAID 0 or 1 or 10: log files and bfims directory

In data warehouse applications, the fact table records are usually very large, because there is often a large number of measure attributes. Therefore the page size should be comparably large, too. Also for the clustering and access characteristics of the Hypercube index, the page size should be large enough to store several records within one page. Most data warehouses in Transbase Hypercube are implemented with page sizes of 8KB or 16KB.

You can test the number of records stored within one page (on average) using the tbcheck tool. See the System Guide for more information about tbcheck.

A too large page size, however, has influence to the load performance, if BFIM logging is used (see Section Logging Mode for more information about this topic).

7.6.2. Logging Mode

There are two different logging variants in Transbase :

Before Image (BFIM) Logging
Delta Logging

These two concepts are well-known database technologies to implement transaction recovery (and for delta logging also general recovery) mechanisms. We only give a very short overview of these techniques, because they are described in the System Guide and in many database management system implementation books and articles.

The BFIM logging writes each page that must be modified into a so called bfim area (before modification). The changes are made in the disk file and are made permanent when committing the transaction. When the transaction is rolled back (in Transbase : abort), the modified pages in the disk file are replaced by the before image copies of the bfim file. This logging method is recommended for long write transactions such as the initial load. For the initial load, it is very suitable, because the tables to fill are usually empty and no before image pages must be written into the bfim file. The BFIM logging does not allow incremental backups.

The delta logging is a kind of logical logging, where the database modifications are written consecutively into a log file. This log is synced at the end of the transaction (commit/rollback). However, the disk file is synced very seldomly. This method speeds up write transactions significantly and is recommended for relative short transactions (especially for OLTP applications) but also for differential load of data. This logging method is necessary to run incremental backups and do database recovery to a particular time stamp (savepoint).

Since for both techniques, data is written into the disk file and the log files quasi parallel, you should use two different physical disks for the disk files and log files. In this case, the overall database performance for write transactions will be increased significantly. The read transactions, however, are not influenced. For more information about how to specify different directories for log files and disk files, please refer to the Transbase System Guide.

7.6.3. Database Caches

The database caches are very important for read and write performance. Usually a complete system has a number of caches at each application level. The hard disk controller has a disk cache, the operating system has a file system cache, the database management system has its own main memory caches, the CPU has third level, second level, first level, and register caches etc. Caches are used to store data that is accessed in the future in faster memory structures. Thus, a reasonable configuration of these caches is mandatory for reasonable performance of complete systems.

Transbase provides two kinds of caches:

The Data Cache is a database global cache that is used by Transbase. Thus, if one service thread has read a specific data page that is needed later by another service thread, the thread finds it in the data cache and does not have to access the hard disk. Since the overall configuration of nowadays systems vary very much (especially for data warehouse architectures) it is difficult to give recommendations for the configuration of the data cache. For large main memory systems (several GByte of main memory), we recommend to use no more than 50% of the whole main memory for the data cache. The file system cache of the operating system often also supports the DBMS and uses all free memory in modern operating systems.
The Sorter Cache is a local cache defined for each worker thread. The size for the sorter cache only is the maximum memory that is allocated such a thread for various operations like sorting, hashing etc.. Therefore the maximum memory consumption following the sorter cache parameter specification is the size of the sorter cache multiplied by the number of threads that run in parallel. Please consider that this maximum size is reached extremely seldom, because the sizes are maximum sizes and for most queries and operations are much less!

For data warehouse applications, the size of the sorter cache is critical for mass load operations of the fact table, because the special mass load optimization for the reference surrogate lookup is oriented on the size of the sorter cache. Since for each dimension table of the corresponding fact table the record (dimID, compSurr) is stored in a hash table, the memory consumption depends on the size (and number) of the dimension tables. The size of the sorter cache should be twice as much as the sum of the memory consumption for these hashings!

Here is an example for the calculation:

(| (dimKey1, cs1) | * |D1| + ... + | (dimKeyn, csn) | * | Dn | )* 2

E.g., for 4 dimensions with Type(dimKey) = INTEGER for all dimensions and

|D1| = 100, |D2| = 500, |D3| = 1000 and |D4| = 1000:
|(INTEGER, cs) | = 4 + 6 = 10
LC = 10 *100 + 10*500 + 10*1000 + 10*1000 Byte =
= 26.000 Byte = 26 KB

Please consider that dimension tables usually grow during the data warehouse live. Typical sizes for the sorter cache are 10.000 to 50.000 KB. See the Transbase System Guide how to set the sorter and data caches. For the query execution, the size of the sorter cache is not very critical.

8. Conclusion

Summarizing the previous chapters and sections, Transbase Hypercube is a relational database management system with the Hypercube index to provide multidimensinal clustering. In combination with the MHC technology, Transbase Hypercube is especially suitable as database for data warehouse applications.

Special mechanism are implemented to support star and snowflake schemata, efficient query processing of star queries, mass loading of data and parallel user access. Note that these mechanisms must be designed carefully to use them for own applications.

The most important steps when building a new data warehouse is the analysis of the user queries, in order to build a valuable and optimized database schema. Bad schema design can cause many problems in terms of functionality, performance, scalability, and future handling of the data warehouse. Regular review of the data warehouse and the database can detect problems early and give you the time to solve them before users get knowledge of them.

Appendix A. References

Here are the references to papers that are mentioned in the manual:

[Bay97]: R. Bayer: The universal B-Tree for multi-dimensional Indexing: General Concepts. WWCA '97. Tsukuba, Japan, LNCS, Springer Verlag. March 1997.
[Kar02]: N. Karayannidis, A. Tsois, T. Sellis, R. Pieringer, V. Markl, F. Ramsak, R. Fenk, K. Elhardt, R. Bayer: Processing Star Queries on Hierarchically-Clustered Fact Tables. VLDB 2002.
[Kim96]: R. Kimball: The Data Warehouse Toolkit. John Wiley & Sons, New York. 1996.
[Pie03]: R. Pieringer, K. Elhardt, F. Ramsak, V. Markl, R. Fenk, R. Bayer, N. Karayannidis, A. Tsois, T. Sellis: Combining Hierarchy Encoding and Pre-Grouping: Intelligent Grouping in Star Queries. ICDE 2003.
[Ram00]: F. Ramsak, V. Markl, R. Fenk, M. Zirkel, K. Elhardt, R. Bayer: Integrating the UB-Tree intoa Database System Kernel. In VLDB2000,Proceedings of International Conference on Very Large Data Bases, 2000. Cairo, Egypt, 2000.