Teradata Material: DATAWAREHOUSE

Operational DataStore(ODS)

An ODS stands for -Operational DataStore.

ODS is a database which is designed to integrate data from multiple sources for additional operations on the data like - cleaning, resolving redundancy and checking against business rules for integrity etc.

Operational data store is a real time integrated data store in the organization to manage the business process and their results.

Data Warehouse is used for strategic decision making whereas ODS is used to day to day decision making.

Fact tables

Fact -A fact table typically has two types of columns, foreign keys to dimension tables and measures those that contain numeric facts. A fact table can contain fact’s data on detail or aggregated level.

Types of Facts -

Additive:

Additive facts are facts that can be summed up through all of the dimensions in the fact table. A sales fact is a good example for additive fact.

Semi-Additive:

Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.
Eg: Daily balances fact can be summed up through the customers dimension but not through the time dimension.

Non-Additive:

Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.
Eg: Facts which have percentages, ratios calculated.

Factless Fact Table:

In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called “Factless Fact tables”.

Eg: A fact table which has only product key and date key is a factless fact. There are no measures in this table. But still you can get the number products sold over a period of time.

Based on the above classifications, fact tables are categorized into two:

Cumulative:

This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table.

Snapshot:

This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.

Dimension tables in datawarehouse

Dimension -

A dimension table typically has two types of columns, primary keys to fact tables and textual\descreptive data.

Types of Dimensions -

Slowly Changing Dimensions:

Attributes of a dimension that would undergo changes over time. It depends on the business requirement whether particular attribute history of changes should be preserved in the data warehouse. This is called a Slowly Changing Attribute and a dimension containing such an attribute is called a Slowly Changing Dimension.

Rapidly Changing Dimensions:

A dimension attribute that changes frequently is a Rapidly Changing Attribute. If you don’t need to track the changes, the Rapidly Changing Attribute is no problem, but if you do need to track the changes, using a standard Slowly Changing Dimension technique can result in a huge inflation of the size of the dimension. One solution is to move the attribute to its own dimension, with a separate foreign key in the fact table. This new dimension is called a Rapidly Changing Dimension.

Junk Dimensions:

A junk dimension is a single table with a combination of different and unrelated attributes to avoid having a large number of foreign keys in the fact table. Junk dimensions are often created to manage the foreign keys created by Rapidly Changing Dimensions.

Inferred Dimensions:

While loading fact records, a dimension record may not yet be ready. One solution is to generate an surrogate key with Null for all the other attributes. This should technically be called an inferred member, but is often called an inferred dimension.

Conformed Dimensions:

A Dimension that is used in multiple locations is called a conformed dimension. A conformed dimension may be used with multiple fact tables in a single database, or across multiple data marts or data warehouses.

Degenerate Dimensions:

A degenerate dimension is when the dimension attribute is stored as part of fact table, and not in a separate dimension table. These are essentially dimension keys for which there are no other attributes. In a data warehouse, these are often used as the result of a drill through query to analyze the source of an aggregated number in a report. You can use these values to trace back to transactions in the OLTP system.

Role Playing Dimensions:

A role-playing dimension is one where the same dimension key — along with its associated attributes — can be joined to more than one foreign key in the fact table. For example, a fact table may include foreign keys for both Ship Date and Delivery Date. But the same date dimension attributes apply to each foreign key, so you can join the same dimension table to both foreign keys. Here the date dimension is taking multiple roles to map ship date as well as delivery date, and hence the name of Role Playing dimension.

Shrunken Dimensions:

A shrunken dimension is a subset of another dimension. For example, the Orders fact table may include a foreign key for Product, but the Target fact table may include a foreign key only for ProductCategory, which is in the Product table, but much less granular. Creating a smaller dimension table, with ProductCategory as its primary key, is one way of dealing with this situation of heterogeneous grain. If the Product dimension is snowflaked, there is probably already a separate table for ProductCategory, which can serve as the Shrunken Dimension.

Static Dimensions:

Static dimensions are not extracted from the original data source, but are created within the context of the data warehouse. A static dimension can be loaded manually — for example with Status codes — or it can be generated by a procedure, such as a Date or Time dimension.

What is a staging area? Why we need it? What is the purpose of a staging area?

Staging area is place where we the hold temporary tables on data warehouse server.

Staging tables are connected to work area or fact tables.

We basically need staging area to hold the data for cleansing(removing spaces/nulls or to provide expected date format etc.)

and merging purpose, before loading the data into data warehouse/data mart.

What is Datamart ? why we need it? Types of Datamarts

Definition of datamart:–

Datamart is a subset of a Data Warehouse that supports the requirements of a particular department or business function.
Data Mart is specialized single-line of business warehouses. Data Mart collects data for a single department or a specific group of people.

Below are the reasons to create Data Marts :-

1. Provide users access to data the need to analyze most often
2. Provide data in a common format for a group of users
3. Improve end user response time
4. Provide appropriately structured data based on the requirements of end user access
tools
5.Less complex implementation and setup
6.Reduced cost
7.Easier to support users

Types of Datamarts

There are three types of Datamarts mentioned below :-

1. Multidimensional Datamarts
2. Relational Datamarts
3. Networked Datamarts

Multidimensional Datamarts:-

Contain only numeric data
Maintain structure as data enters the framework
Only changed cells are updated and not the others
Contain sparsely populated matrices

Relational Data Marts :-

Contains numeric and textual both the data
Support star schemas/snow flake schema
Maintain structured data
Employ indexes while creating datamarts

Networked Data Marts:-

Data needs to be cleansed and standardized
Distributed data marts can be accessed from a single workstation
A standard metadata directory must exist in it

Data Mining

Data mining is a process of extracting hidden trends within a datawarehouse.

Example –An insurance datawarehouse can be used to mine data to identify the most high risk buildings to insure in a certain geographical area .

Data Mining also involves analyzing moderate to large amounts of detailed historical data to detect behavioral patterns that are then used to predict future behavior.

Using Data Mining, organization can able to find the answer of the questions like-Which customers are most likely to leave?

Advantages of Data Ware Housing

Below are the advantages of data warehousing:-

Increase customer profitability

Cost effective decision making

Manage customer and business partner relationships

Manage risk, assets and liabilities

Integrate inventory, operations and manufacturing

Improving turnaround time for data access and reporting to locate, access, and analyze information ,for example –To link multiple locations and geographies

Identify developing trends and reduce time to market

Strategic advantage over competitors

Data Warehouse lifecycle

Below is the Data Warehouse project lifecycle:-

• Project Planning

• Business Requirement Definition(SRS)

• Technical Architecture Design

• Dimensional Modeling

• Physical Design (Data Base designing)

• Data Staging Design and Development (ETL)

• Analytic Application Specification Design and Development (OLAP)

• Testing and Production Deployment

• Maintenance and Support

Components of datawarehouse?

Below are the components of a data warehouse:-

Source databases/systems (Example - Legacy, relational, text or external sources)

Data extraction/transformation/load (ETL) tool (Example –Informatica, Datastage)

Data warehouse maintenance and administration tools

Data modeling tool or interface to external data models (Example –Erwin)

Warehouse databases (Example-Teradata, Oracle, and MySql)

End-user data access and analysis/reporting tools (Example –OBIEE, Business Objects)

What is datawarehousing ?

Data Warehousing is a process to properly assemble and manage data from various sources to answer business questions not previously possible or known. A Data Warehouse has a centrally located logical architecture which minimizes data synchronization hence provides a single view of the business. A Data Warehousing is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process.

Teradata Material

Pages

Thursday, 26 June 2014