Thursday, 26 June 2014

Operational DataStore(ODS)


An ODS stands for -Operational DataStore.
ODS is a database which is designed to integrate data from multiple sources for additional operations on the data like - cleaning, resolving redundancy and checking against business rules for integrity etc.
Operational data store is a real time integrated data store in the organization to manage the business process and their results.
Data Warehouse is used for strategic decision making whereas ODS is used to day to day decision making.

Fact tables

Fact -A fact table typically has two types of columns, foreign keys to dimension tables and measures those that contain numeric facts. A fact table can contain fact’s data on detail or aggregated level.
Types of Facts -
Additive:
Additive facts are facts that can be summed up through all of the dimensions in the fact table. A sales fact is a good example for additive fact.
Semi-Additive:
Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.
Eg: Daily balances fact can be summed up through the customers dimension but not through the time dimension.
Non-Additive:
Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.
Eg: Facts which have percentages, ratios calculated.

Factless Fact Table:
 In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called “Factless Fact tables”.
Eg: A fact table which has only product key and date key is a factless fact. There are no measures in this table. But still you can get the number products sold over a period of time.
Based on the above classifications, fact tables are categorized into two:
Cumulative:
This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table.
Snapshot:
This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.

Dimension tables in datawarehouse

Dimension -
A dimension table typically has two types of columns, primary keys to fact tables and textual\descreptive data.
Types of Dimensions -
Slowly Changing Dimensions:
 Attributes of a dimension that would undergo changes over time. It depends on the business requirement whether particular attribute history of changes should be preserved in the data warehouse. This is called a Slowly Changing Attribute and a dimension containing such an attribute is called a Slowly Changing Dimension.

Rapidly Changing Dimensions:
A dimension attribute that changes frequently is a Rapidly Changing Attribute. If you don’t need to track the changes, the Rapidly Changing Attribute is no problem, but if you do need to track the changes, using a standard Slowly Changing Dimension technique can result in a huge inflation of the size of the dimension. One solution is to move the attribute to its own dimension, with a separate foreign key in the fact table. This new dimension is called a Rapidly Changing Dimension.

Junk Dimensions:
A junk dimension is a single table with a combination of different and unrelated attributes to avoid having a large number of foreign keys in the fact table. Junk dimensions are often created to manage the foreign keys created by Rapidly Changing Dimensions.

Inferred Dimensions:
While loading fact records, a dimension record may not yet be ready. One solution is to generate an surrogate key with Null for all the other attributes. This should technically be called an inferred member, but is often called an inferred dimension.

Conformed Dimensions:
A Dimension that is used in multiple locations is called a conformed dimension. A conformed dimension may be used with multiple fact tables in a single database, or across multiple data marts or data warehouses.

Degenerate Dimensions:
 A degenerate dimension is when the dimension attribute is stored as part of fact table, and not in a separate dimension table. These are essentially dimension keys for which there are no other attributes. In a data warehouse, these are often used as the result of a drill through query to analyze the source of an aggregated number in a report. You can use these values to trace back to transactions in the OLTP system.

Role Playing Dimensions:
A role-playing dimension is one where the same dimension key — along with its associated attributes — can be joined to more than one foreign key in the fact table. For example, a fact table may include foreign keys for both Ship Date and Delivery Date. But the same date dimension attributes apply to each foreign key, so you can join the same dimension table to both foreign keys. Here the date dimension is taking multiple roles to map ship date as well as delivery date, and hence the name of Role Playing dimension.

Shrunken Dimensions:
A shrunken dimension is a subset of another dimension. For example, the Orders fact table may include a foreign key for Product, but the Target fact table may include a foreign key only for ProductCategory, which is in the Product table, but much less granular. Creating a smaller dimension table, with ProductCategory as its primary key, is one way of dealing with this situation of heterogeneous grain. If the Product dimension is snowflaked, there is probably already a separate table for ProductCategory, which can serve as the Shrunken Dimension.

Static Dimensions:
Static dimensions are not extracted from the original data source, but are created within the context of the data warehouse. A static dimension can be loaded manually — for example with Status codes — or it can be generated by a procedure, such as a Date or Time dimension.

What is a staging area? Why we need it? What is the purpose of a staging area?


Staging area is place where we the hold temporary tables on data warehouse server.
Staging tables are connected to work area or fact tables.
We basically need staging area to hold the data for cleansing(removing spaces/nulls or to provide expected date format etc.)
and merging purpose, before loading the data into data warehouse/data mart.

Types of Schemas in dimensional modeling:


 Below are the types of schema in dimensional modeling with their advantages and disadvantages:-

Star schema :-
A logical structure that has a fact table containing factual data in the center, surrounded by dimension tables containing reference data, which can be denormalized

Advantage
Improve query performance for often-used data
Easy to define
Reduces number of physical joins

Disadvantage
High overhead of redundant data
Inappropriate when additional data is not accessed often.

Snowflake schema
A variant of the star schema where dimension tables do not contain denormalized data
It allows dimensions to have dimensions

Advantage
Provides best performance when queries involve aggregation

Disadvantage
Maintenance is complicated
Number of tables involve are more
More joins are needed

Starflake schema
A hybrid structure that contains a mixture of star and snowflake schema

What is Datamart ? why we need it? Types of Datamarts


Definition of datamart:–

Datamart is a subset of a Data Warehouse that supports the requirements of a particular department or business function.
Data Mart is specialized single-line of business warehouses. Data Mart collects data for a single department or a specific group of people.


Below are the reasons to create Data Marts :-

1. Provide users access to data the need to analyze most often
2. Provide data in a common format for a group of users
3. Improve end user response time
4. Provide appropriately structured data based on the requirements of end user access
    tools
5.Less complex implementation and setup
6.Reduced cost
7.Easier to support users


Types of Datamarts


There are three types of Datamarts mentioned below :-

1. Multidimensional Datamarts
2. Relational Datamarts
3. Networked Datamarts


Multidimensional Datamarts:-

Contain only numeric data
Maintain structure as data enters the framework
Only changed cells are updated and not the others
Contain sparsely populated matrices

Relational Data Marts :-

Contains numeric and textual both the data
Support star schemas/snow flake schema
Maintain structured data
Employ indexes while creating datamarts

Networked Data Marts:-

Data needs to be cleansed and standardized
Distributed data marts can be accessed from a single workstation
A standard metadata directory must exist in it
 

Data Mining


Data mining is a process of extracting hidden trends within a datawarehouse.
Example –An insurance datawarehouse can be used to mine data to identify the most high risk buildings to insure in a certain geographical area .
Data Mining also involves analyzing moderate to large amounts of detailed historical data to detect behavioral patterns that are then used to predict future behavior.
Using Data Mining, organization can able to find the answer of the questions like-Which customers are most likely to leave?

Advantages of Data Ware Housing


Below are the advantages of data warehousing:-
Increase customer profitability
Increase customer profitability
Cost effective decision making
Manage customer and business partner relationships
Manage risk, assets and liabilities
Integrate inventory, operations and manufacturing
Improving turnaround time for data access and reporting to locate, access, and analyze information ,for example –To link multiple locations and geographies
Identify developing trends and reduce time to market
Strategic advantage over competitors

Data Warehouse lifecycle


Below is the Data Warehouse project lifecycle:-
• Project Planning
• Business Requirement Definition(SRS)
• Technical Architecture Design
• Dimensional Modeling
• Physical Design (Data Base designing)
• Data Staging Design and Development (ETL)
• Analytic Application Specification Design and Development (OLAP)
• Testing and Production Deployment
• Maintenance and Support

Components of datawarehouse?

Below are the components of a data warehouse:-

Source databases/systems (Example - Legacy, relational, text or external sources)
Data extraction/transformation/load (ETL) tool (Example –Informatica, Datastage)
Data warehouse maintenance and administration tools
Data modeling tool or interface to external data models (Example –Erwin)
Warehouse databases (Example-Teradata, Oracle, and MySql)
End-user data access and analysis/reporting tools (Example –OBIEE, Business Objects)

What is datawarehousing ?

Data Warehousing is a process to properly assemble and manage data from various sources to answer business questions not previously possible or known. A Data Warehouse has a centrally located logical architecture which minimizes data synchronization hence provides a single view of the business. A Data Warehousing is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process.

Global Temporary tables in teradata

About GLOBAL TEMPORARY: 

You are working in a Teradata database application. 
You have to CREATE a table and insert some records for performance analysis. 
The table has several columns and you have to DELETE the content once you have done with your job. 
But you don't want to drop the table since it is being used daily for the analysis. 

What can be done for this? 

Teradata gives GLOBAL TEMPORARY option with CREATE TABLE syntax. 

What special ? 

If you create a table with GLOBAL TEMPORARY option, the life of the CONTENTS of the table will be only for the current session. 
They will be automatically deleted by Teradata manager once session expires. (Only the content not definition) 

Syntax: 
  1. CREATE GLOBAL TEMPORARY TABLE table_1
  2. (
  3. column1 datatype,
  4. column2 datatype,
  5. .
  6. .
  7. columnN datatype
  8. );



  1. CREATE GLOBAL TEMPORARY TABLE employee
  2. (
  3. ID INT,
  4. Name VARCHAR(20)
  5. );


Interesting about GLOBAL TEMPORARY: 

Now, you are creating a GLOBAL TEMPORARY table and inserting some records successfully into it. 
SELECT * from the GLOBAL TEMPORARY table. Shocked ? ... 
Yes. You cannot see the records because Teradata becomes strict to GLOBAL TEMPORARY tables by incorporating ON COMMIT DELETE ROWS by default to add more meaning to TEMPORARY. 

so, if you want to see the records, you have to explicitly write ON COMMIT PRESERVE ROWS immediately after CREATE TABLE syntax. 

  1. CREATE GLOBAL TEMPORARY TABLE table_1
  2. (
  3. column1 datatype,
  4. column2 datatype,
  5. .
  6. .
  7. columnN datatype
  8. ) ON COMMIT PRESERVE ROWS;



More about GLOBAL TEMPORARY: 

1. You can have 2000 volatile tables in an active session. 
2. REFERENTIAL constraints are not allowed. 



Here GLOBAL TEMPORARY says, DO and GO.. i will PRESERVE and take care... 

..

Volatile table in teradata

You are working in a Teradata database application. You have to CREATE a table and insert some records for performance analysis. You have to DROP the table once you have done with your job. If this is the case, will you think a smart way for this?

Yeah. Teradata gives VOLATILE option with CREATE TABLE syntax. 

What special ?

If you create table with VOLATILE option, the life of the table will be only for the current session.
It will be automatically dropped by Teradata manager once session expires. (Both definition and content)

Syntax:
  1. /* Volatile Table Syntax - Teradata */
  2. CREATE VOLATILE TABLE table_1
  3. (
  4. column1 datatype,
  5. column2 datatype,
  6. .
  7. .
  8. columnN datatype
  9. );



  1. CREATE VOLATILE TABLE employee
  2. (
  3. ID INT,
  4. Name VARCHAR(20)
  5. );


Interesting about VOLATILE:

Now, you are creating a VOLATILE table and inserting some records successfully into it.
SELECT * from the volatile table. Shocked ? ...
Yes. You cannot see the records because Teradata becomes strict to VOLATILE tables by incorporating ON COMMIT DELETE ROWS by default to add more meaning to VOLATILE.

so, if you want to see the records you have to explicitly write ON COMMIT PRESERVE ROWS immediately after CREATE TABLE syntax.

  1. CREATE VOLATILE TABLE table_1
  2. (
  3. column1 datatype,
  4. column2 datatype,
  5. .
  6. .
  7. columnN datatype
  8. ) ON COMMIT PRESERVE ROWS;



More about VOLATILE:

1. You can have 1000 volatile tables in an active session.
2. CHECK and REFERENTIAL constraints are not allowed.
3. DEFAULT clause not allowed.


Here VOLATILE says, DO and GO.. i will take care...

COALESCE

COALESCE is used to check if the argument is NULL, if it is NULL then it takes the default value.

It will check for NOT NULL values sequentially in the list and it will return the first NOT NULL value.

Syntax:
  1. COALESCE(Argument list,['default value'])

Argument list - May be columns of a table or expressions

COALESCE returns NULL if all its arguments evaluate to null. 
Otherwise, it returns the value of the first non-null argument in the given expression list.
Each COALESCE function must have at least two operands. Either two columns or one column with one default value.

It is possible to have multiple columns under coalesce like below:
  1. COALESCE(col1, col2, col3, 0)

The above code says that if col1 is null then it will check col2.
If col2 is null then it will check for col3.
If col3 is also null, then 0 will be the output.

If coalesce encounters any of the column having not null value then the column value will be returned by the query. (Left to right priority)

Example : 1
If the country takes NULL value, 'US' will be supplied by Teradata
  1. SELECT COALESCE(Country,'US') as Output

Output
CountryOutput
ItalyItaly
NULLUS

Example: 2
The following example returns the home phone number of the named individual (if present), or office phone if HomePhone is null, or CellPhone if present and both home and office phone values are null. Returns NULL if all three values are null.

  1. SELECT
  2. Name,
  3. COALESCE (HomePhone, OfficePhone, cellPhone) FROM PhoneDirectory;

When a default value is provided, default value will be returned if all the input fields are null
  1. SELECT
  2. Name,
  3. COALESCE (HomePhone, OfficePhone, cellPhone,'No Phone') FROM PhoneDirectory;


Example:3
Whenever you combine COALESCE with non-comparison operators, such as || or * or +, you must enclose the entire COALESCE function and its operands in parentheses.

  1. SELECT (COALESCE(100,0)) * 5;