Tuesday, January 21, 2020

Loading facts in ETL


LOADING FACTS
The most important thing about loading fact tables is that first you need to load dimension tables and then according to the specification the fact tables.


The fact table is often located in the center of a star schema, surrounded by dimension tables. It has two types of columns: those containing facts and other containing foreign keys to dimension tables.



FACT TABLE CONSISTS OF:
Measurements: additive-measures that can be added across all dimensions , non-additive cannot be added and semi-additive can be added across few dimensions.

METRICS
Facts - The fact table has frequently more than one measurement field and then each field is called a fact. Usually one fact table has at least three dimension tables.
Fact table granulatity
First point is creating the grain of a table.

GRAIN

  • The grain is simply a definition of a measurement event and allows dimension tables to fallow appropriate fact table. The connection between both appears through the foreign keys in the fact table and provide the context of the measurements. Every fact table should have only one grain(two or more are not recommended).


TYPES OF FACT TABLE
Transaction grain. Use only if the measurement events take place. It is for instantaneous actions. Those tables are the largest of all three types.
Periodic snapshot. Use for long tracking processes like bank account. It is a measurement at a specific point of time , regularly repeated and it refers to the period of time in the dimension table. The most common is monthly snapshot.
Accumulating snapshot. Use for processes that have defined beginning and end. It shows a status in every moment we need.
Preserving integrity
To preserve the integrity all foreign keys references should be correct and known. However there are two situations when this integrity may be broken. It is when we load fact tables with incorrect foreign key or when we delete a dimension with a related primary key. Then it is impossible to make a connection between those tables and we may loose some very important relationships.

Any query that references this relationship will fail. To avoid this problem you should analyze the loading phase in every step from preparation , through loading and after loading by scanning the fact table and looking for bad foreign keys.



Next point is checking the surrogate keys. When we load fact table we assume that all records are current. This operation refers to a surrogate key pipeline where the source data for the fact table contains natural keys that must be replaced by surrogate keys before the records are loaded. They are found through the line of little lookup surrogate tables which are updated every time a new dimension entity is created or changed.

Only with appropriate current surrogate key we may connect to a fact table with good results. However the lookup surrogate tables are not effective if we are handling with late arriving records. Then the dimension is a source of the surrogate key.

In the beginning of this article I have mentioned about the primary and surrogate key of the dimension table. Now we will look closer at those two elements considering two cases: the current database type and the temporary database type.

A current database stores only current data and the relationship between the surrogate and primary key is one to one so the surrogate key may be used as a primary key.

In the temporary database this relationship is many to one and then we cannot make this replacement.

The database should reflect the data of its source to provide appropriate business activity. In other words all data should be current and correct.

HANDLING INCORRECT DATA

Negate the fact (do nothing)
Update the fact (overwrite)
Delete and remove the fact - the most common way. We may delete facts physically and logically. When deleted physically we just delete permanently unwanted record while in logically deleting we create a delete column for unwanted records. The first one is easier but the second is much safer and most widely used.
Factless fact tables
Now we will say shortly about some other points of loading fact tables.

The fact less fact table is a table that contains only dimensional keys without any measured values.

TYPES OF FACT TABLE:

First type is used for capturing the event for information not calculation purposes. It is just an information about the even that happened.
Second used for support negative analysis report and guarantee coverage.
Multiple measures units
Sometimes the number of measurements is so big that presenting them may be quite inconvenient. To resolve this problem we use special units of measure in a fact table.

The currencies
In many databases which influence business decisions there is a need to express facts in many currencies. Then we create a foreign key in the fact table(like a sale fact table) which connect to the currency dimension table.

Late arriving facts
We locate those late arriving facts in historical position but we have to be very careful in choosing old contemporary dimension records. To make this operation appropriately you should follow these steps: First - find the corresponding dimension record, then replace the natural keys of late arriving facts with the surrogate key founded in dimension records and locate this late arriving fact into a correct place in the database where other fact tables from that time are stored.

No comments:

Post a Comment