Saturday 22 February 2014

All AMP operation vs Full Table Scan

All AMP Operations vs Full Table Scan
All AMP Operation (NUSI) :
1. The SQL is submitted, specifying a NUSI.The hashing algorithm calculates a row hash value for
the NUSI .
2. All AMPs are activated to find the hash value of the NUSI in their index subtables. The AMPs
whose subtables contain that value become the participating AMPs in this request The other
AMPs discard the message.
3. Each participating AMP locates the row IDs (row hash value plus uniqueness value) of the base
rows corresponding to the hash value.
4. The participating AMPs access the base table rows, which are located on the same AMP as the
NUSI subtable.
5. The qualifying rows are sent over the BYNET to the PE, and the PE sends the answer set on to
the client application .


Full-Table Scan : (Accessing data with out indexes) :
If the request does not use a defined index, the Teradata Database does a full-table scan. A full-table scan is another way to access data without using Primary or Secondary Indexes. In evaluating an SQL request, the Optimizer examines all possible access methods and chooses the one it believes to be the most efficient.

A request could turn into a full-table scan when:
The SQL request searches on a NUSI column with many duplicates.
An SQL request uses a non-equality WHERE clause on an index column.
An SQL request uses a range WHERE clause on an index column.

For all requests, you must specify a value for each column in the index

Data File Formats in bteq and load utilities
Bteq :

While creating the data file (exporting from table) the following modes are accepted.
DATA :
 Record Mode (Raw Binary Data).
INDICDATA :
 contains indicator bytes.number of indicator bytes = 1 bit for each column.that bit will be
set to 1 if the value for a perticular column is NULL else it will set to 0.
REPORT :
 Field Mode (column Headings).data is exported 1 column at a time. default format.
DIF :
 Data Interchange Format . to transport data to various pc programs.

while importing target table from data file the following modes are accepted.
DATA :
INDICDATA :
VARTEXT :
 each column is separated by a delimiter.

FastLoad,MultiLoad,FastExport,TPUMP,TPT
there are 5 data formats available for load or unload utilities.
Formatted
 :
Each record is in a format traditionally known as FastLoad or Teradata format. The data is prefixed with the data length and followed by and end-of-record (EOR) marker. This is the most stringent and robust data format and will allow any TPT supported data type to be included.
Let’s consider a record containing 6 bytes of inferred text data "abcdef";

x'00 06 61 62 63 64 65 66 0a' and in character: "..abcdef."

Text :
There is no length indicator as with formatted data (above). Records are separated only by an end-of-record (EOR) marker. To insure data integrity, each record must be entirely character data.
The EOR marker can be either a single-byte linefeed (X'0A') or a double-byte carriage-return/line-feed pair (X'0D0A'). Whether a single or double byte EOR is used is determined by the EOR marker encountered for the first record. If the first record has a single byte EOR and contains the text "abcd", it would look like this:
x'61 63 64 63 0a' and in character: "abcd."

Every subsequent record is expected to use the same single byte EOR.
If the first record has a double byte EOR and contains the text "acdc", it would look like this:
x'61 63 64 63 0d 0a' and in character: "abcd.."

Again, every subsequent record is expected to use the same double-byte EOR.
Single and double byte EORs cannot be mixed in the same data.

Binary :
This format is the same as formatted (see above) except that there is no trailing EOR byte. Each record contains a 2-byte integer data length, n, followed by n bytes of data. As with formatted data, all TPT data types are supported.
E.g., a record containing 4 bytes of text "abcd" would be;
x'00 06 61 62 63 64' and in character: "..abcd"

Delimited / variable length Text :
This format is an extension of the text format. It is similar because it must contain all character data and be terminated by an EOR marker. But it has a different restriction: all columns in the TPT schema must be VARCHAR** (as opposed to CHAR/ANSIDATE for simple text). The advantage of this format is that it supports an additional feature that allows the definition of individual fields (which map to DBS table row columns) separated by the special delimiter character. To successfully parse delimited data, the delimiter character can not appear within the data. Delimiter data embedded within a column will result in a data column count error.
UnFormatted :
The data does not conform to any external format. There is neither a prepended length as with binary and formatted nor is there an EOR marker as with text and delimited.
Collect Statistics In Teradata
Collect stats is an important concept in teradata.
collect stats gives PE to come up with a plan with least cost for an requested query.
 
Collect stats defines the confidence level of PE in estimating "how many rows it is going to access ? how many unique values does a table have , null values etc and all this info is stored in data dictionary. Once you submit a query in TD the parsing engine checks if the stats are avaiable for the requested table , if it has collected stats earlier PE generates a plan with "high confidence" .
in absence of collect stats plan will PE with "low confidence" . however teradata's optimizer is very robust and intelligent, even if you do not collect stats on a table,column,indexes PE does an "Dynamic Amp sampling " which means it will select a random amp and this random amp comes up with the info about the table data it has , based upon this PE ( knows datademographics & available system componets) estimates the workload and generates a plan.
how to find the list of views that are created on a table in teradata
the following query gives the list of views that are created on a table.

select tablename from dbc.tables where TableKind='V' and RequestText LIKE '%Table%' group by 1 order by 1;

In place of Table specify your table name.
for example if i want to find the list of views that are created on table test1,i will use the following query.

select tablename from dbc.tables where TableKind='V' and RequestText LIKE '%test1%' group by 1 order by 1;

The query displays

TableName
------------------------------
view1
view2
view3
How to remove duplicate rows from a teradata table
This method uses a temporary table to remove the duplicate rows.
if a table contains duplicate rows means
It should be
1. MultiSet tabel.
2. should not contain UPI.

let us assume we need to remove duplicate rows from tab1.
the table is created as
create multiset table tab1(col1 integer,col2 integer);
populate the table with duplicate rows.

STEP1:
create a temporary SET/MULTISET table () with the following query.
create set table temp_tab1 as tab1 with no data;
this will create a temporary set table which does not allow duplicates.

STEP2:
populate the temporary temporary set table by removing the duplicate rows from the original table. this can be done using the following query.

insert into temp_tab1 select * from tab1 group by col1,col2 having count(*) > 1;
(or)
insert into temp_tab1 select distinct * from tab1 ;


STEP3:
delete all the duplicate rows from the original table.(this query rows all the duplicate rows including the base row).
 delete from tab1
   where (col1,col2)
   in (select col1,col2 from temp_tab1);

STEP4:
insert the rows from the temp table to the base table using the following query.
insert into tab1
   sel * from temp_tab1;

there are two more methods to delete duplicate rows without using the temporary
table  using the Teradata Fastload Utility and Teradata Warehouse Builder.

Method1 :
Use Fast Export to write the rows from the source table into a file.
Then Delete all rows from the source table.
then use FastLoad to load the target table.The file to load the table is generated from the FastExport.

Method 2:
Use Teradata WareHouseBuilder (TWB) with 2 steps.
in the first step use data connector as consumer and Export operator as producer.
in the second step use Dataconnector as producer and load operator as consumer.
these are based on the fact that Fastload cannot load duplicate rows.
concept of indexes in teradata
UPI, NUPI:
Teradata uses primary index to distribute data across AMPs. 
PI access is always one amp operation as teradata knows exactly where the record is. 
So both UPI, NUPI results in one amp operation. 
Where clause on PI with UPI may return 0 to 1 record
Where clause on PI with NUPI may return 0 to many records.

USI, NUSI:
Secondary index provides an alternate path to access data. 
Index creates sub table on all AMPs. 
It stores index value, index rowhash, rowid of record. So when ever you insert data into a table having index, DB also makes corresponding entries into index sub tables.
USI is 2 AMP operation. First teradata hashes index value, goes to a particular AMP, looks for index rowhash, Then gets the index value and also rowid where record is present. Its definitely much faster than doing Full table scans.
NUSI is all AMP operation but not full table scan. In case of NUSI, each index sub table stores only its values. So if you limit on a NUSI column, Each AMP will search for the limit you apply and returns records if present.
Global Temporary Tables and Volatile Temporary Tables in Teradata
These are Teradata Specific concepts.
Both are temporary tables and are used normally as a kind of "intermediate stage" for an SQL operation.

Global Temporary Tables (GTT) :
1.GTT takes up space from Temporary space
2.GTT survive TD system restarts
3.Index can be created on GTT.
4.CHECK or BETWEEN constraints, COMPRESS column and DEFAULT
and TITLE clause are supported by GTT.
5.you can collect stats on GTT.
6.Table Definition is stored in DD for GTT.
7.In a single session 2000 Global temporary table can be materialized.
8.Materialized instance of GTT will be discarded at session end.

Volatile Temporary Tables (VTT):
1.VT takes up space from spool, doesn't have any info in DD.
so is slightly faster that GTT in that terms because no DD lookup is required.
2.VT can't survive TD system restarts
3.Index cann't be created on GTT.
4.CHECK or BETWEEN constraints, COMPRESS column and DEFAULT and TITLE clause are not supported by VT.
5.No collect stats for VTT.
6.Table Definition is stored in System cache for VTT.
7.In a single session
 1000 Volatile tables can be materialized.
8.Volatile tables are always qualified by the session's userid.
How do i update multiple rows of a table with different values in a single query??
select * from tab1;

*** Query completed. 2 rows found. 2 columns returned.
*** Total elapsed time was 1 second.

col1 col2
----------- ----------
101 vij
102 ven

UPDATE tab1
SET col2 = CASE WHEN col1 = 101 THEN 'XXXX'
WHEN col1 = 102 THEN 'YYYY'
END;


select * from tab1;
*** Query completed. 2 rows found. 2 columns returned.
*** Total elapsed time was 1 second.

col1 col2
----------- ----------
101 XXXX
102 YYYY
Concept of Primary Key and Primary Index in Teradata
In Teradata,
Primary key is used to uniquely identify each row in a table.PKs are used in conjunction with foreign keys to define the important column relationships in a database.

The concept of primary index is completely different from that of Primary key.
Primary Index is used to distribute and locate rows of a table over the AMPS .Choice of Primary Index will affect distribution ,access and performance.

Oftentimes, but not always, the Primary Index and Primary Key are the same.
Indexes (primary or secondary) may be used to enforce uniqueness (as in a PK) or to improve access.

Accessing rows using primary index is always one AMP operation.

PK is a relational modeling convention where as PI is a teradata convention.


Example to create a table with Unique Primary Index (UPI) :
CREATE TABLE MJ1.emp_data,FALLBACK,
NO BEFORE JOURNAL,
NO AFTER JOURNAL
(
employee_number INTEGER NOT NULL,
department_number SMALLINT,
job_code INTEGER COMPRESS ,
last_name CHAR(20) NOT CASESPECIFIC NOT NULL,
first_name VARCHAR(20) NOT CASESPECIFIC,
street_address VARCHAR(30) NOT CASESPECIFIC TITLE 'Address',
city CHAR(15) NOT CASESPECIFIC DEFAULT 'Boise'
COMPRESS 'Boise',
state CHAR(2) NOT CASESPECIFIC DEFAULT ' ',
birthdate DATE FORMAT 'mm/dd/yyyy',
salary_amount DECIMAL(10,2),
sex CHAR(1) UPPERCASE NOT CASESPECIFIC)
UNIQUE PRIMARY INDEX ( employee_number )
INDEX ( department_number );

2 comments:

  1. Hi Experts,

    Need clarifications on Non Unique Primary Index. Example : I have created a EMP table with DEPT_NO has NUPI, since one or more employee work same department.Lets consider I am having 4 AMP system

    case : If i have inserted Dept No value as 20 and it stores in Vdisk of AMP 2 say for example, then again i am going to insert DEPT_NO= 20 for different employee. Now this value also stores in AMP 2.then i insert one more record with DEPT_NO=10, now it goes to different AMP 3 for example.

    Case 1:If i do select * from EMP where DEPT_NO=20 - data is BYNET communicates to ONE AMP

    Case 2: select * from EMP where DEPT_NO in ('20','10'); -- it will communicate to TWO - AMP ?

    Please explain how NUPI is defined as ONE AMP operation.

    ReplyDelete
    Replies
    1. When we say UPI or NUPI does single amp operation, it means you access them using '='. In this way you are accessing an index. If you are using 'IN' clause and you have more than one value, it will definitely be retrieved from different amp . But again , one value of UPI/NUPI belongs to one amp only. Hope it helps.

      Delete