Teradata SQL
29 Aug 2013
17 Jul 2013
--> Journal’s in Teradata ?
1. Transient Journal - This maintains current transaction history. Once the query is successful it deletes entries from its table. If the current query transaction fails, It rolls back data from this table.
2. Permanent Journal - This is defined when a table is created. It can store BEFORE or AFTER image of tables. Both BEFORE and AFTER copy can also be maintained as DUAL Journal.
Permanent Journal maintains complete history of table. Permanent journal is maintained for critical table and a database can have one permanent journal, Permanent journal is used basically for Point-in-time recovery in case of accidental data loss, and this resembles redo mechanism in oracle.
3. Down AMP recovery Journal (DARJ) - Down AMP recovery Journal activates automatically when a AMP goes down. It is used with fallback protected table to maintain any change made on table during AMP is unavailable. When failed AMP is brought online the restart process applies all the changes to recovered AMP using DARJ.
2. Permanent Journal - This is defined when a table is created. It can store BEFORE or AFTER image of tables. Both BEFORE and AFTER copy can also be maintained as DUAL Journal.
Permanent Journal maintains complete history of table. Permanent journal is maintained for critical table and a database can have one permanent journal, Permanent journal is used basically for Point-in-time recovery in case of accidental data loss, and this resembles redo mechanism in oracle.
3. Down AMP recovery Journal (DARJ) - Down AMP recovery Journal activates automatically when a AMP goes down. It is used with fallback protected table to maintain any change made on table during AMP is unavailable. When failed AMP is brought online the restart process applies all the changes to recovered AMP using DARJ.
--> Partitioned Primary Index (PPI) ?
Partitioned Primary Index (PPI)
Partitioned Primary Index is one of the unique features of Teradata, which allows access of portion of data of large table. This reduces the overhead of scanning the complete table thus improving performance.
PPI works by hashing rows to different virtual AMPs, as is done with a normal PI. PPI does not alter data distribution, it only creates partitions on data already distributed based on PI.
Usually PPI's are defined on a table in order to increase query efficiency by avoiding full table scans without the overhead and maintenance costs of secondary indexes.
Partitions are usually defined based on Range or Case as follows.
Two functions, RANGE_N and CASE_N, can be used to simplify the specification of a partitioning expression.
1. Partition by CASE
CREATE TABLE ORDER_Table
(
ORD_number integer NOT NULL,
customer_number integer NOT NULL,
order_date date ,
order_total integer
)
PRIMARY INDEX (customer_number)
PARTITION BY case_n (
order_total < 10000 ,
order_total < 20000 ,
order_total < 30000,
NO CASE OR UNKNOWN ) ;
2. Partition by Range - example using date range
CREATE TABLE ORDER_Table
(
ORD_number integer NOT NULL,
customer_number integer NOT NULL,
order_date date ,
order_total integer
)
PRIMARY INDEX (customer_number)
PARTITION BY range_n (
Order_date BETWEEN date '2010-01-01' AND date '2010-12-01'
EACH interval '1' month,
NO RANGE
OR UNKNOWN);
P.S: If we use NO RANGE or NO CASE - then all values not in this range will be in a single partition.
If we specify UNKNOWN, then all null values will be placed in this partition
PPI improves performance as follows:
Automatic optimization occurs for queries that specify a restrictive condition on the partitioning column.
Uses partition elimination to improve the efficiency of range searches when, for example, the searches are range partitioned.
Only the rows of the qualified partitions in a query need to be accessed avoid full table scans.
Provides an access path to the rows in the base table while still providing efficient join Strategies
If the same partition is consistently targeted, the part of the table updated may be able to fit largely in cache, significantly boosting performance
PPI based tables have advantage during Purging stages. Since purging based on partition is very fast and deletion happens quickly.
PPI also comes with some disadvantages like
The PI access disadvantage occurs only when the partitioning column is not part of the PI. In this situation, a query specifying a PI value, but no value for the partitioning column, must look in each partition for that value, instead of positioning directly to the first row for the PI value.
Another disadvantage is that when another table (without PPI) is joined with PPI table on PI=PI condition. If one of the tables is partitioned, the rows won't be ordered the same, and the task, in effect, becomes a set of sub-joins, one for each partition of the PPI table. This type of join is sliding window join
Limitations of Partitioned Primary Index (PPI) :
Primary index of PPI table has to be
Non unique PI, if PPI column is not part of Index, since enforcing a Unique PI would require checking for a duplicate key value in each partition, which would be very expensive.
Primary Index of PPI table can be Unique, if PPI is part of UPI. This will result in checking for unique constraint in same partition.
PPI cannot be defined on Global temporary tables and Volatile tables and also on compressed join indices
PPI Table rows occupy two extra bytes compared to NPPI table row, as these extra bytes store the partition number for each row .
PPI table rows are four bytes wider if value compression is specified for the table.
Since PPI , results in lot of partitions , there is a little overhead to user/dba. He has to regularly run collect stats on the PPI column.
It is beneficial to collect stats on Partition column .Collecting stats on the system derived column Partition is faster because rather than reading all the base table rows for collecting information, it usually just scans the cylinder index for that PPI table.By doing so, it avoids all unnecessary partitions , thus speeding up the access.
Help stats tablename column PARTITION; -- used to list partitions in table and their details
Collect stats on tablename column PARTITION; -- refresh partition details
Partitioned Primary Index is one of the unique features of Teradata, which allows access of portion of data of large table. This reduces the overhead of scanning the complete table thus improving performance.
PPI works by hashing rows to different virtual AMPs, as is done with a normal PI. PPI does not alter data distribution, it only creates partitions on data already distributed based on PI.
Usually PPI's are defined on a table in order to increase query efficiency by avoiding full table scans without the overhead and maintenance costs of secondary indexes.
Partitions are usually defined based on Range or Case as follows.
Two functions, RANGE_N and CASE_N, can be used to simplify the specification of a partitioning expression.
1. Partition by CASE
CREATE TABLE ORDER_Table
(
ORD_number integer NOT NULL,
customer_number integer NOT NULL,
order_date date ,
order_total integer
)
PRIMARY INDEX (customer_number)
PARTITION BY case_n (
order_total < 10000 ,
order_total < 20000 ,
order_total < 30000,
NO CASE OR UNKNOWN ) ;
2. Partition by Range - example using date range
CREATE TABLE ORDER_Table
(
ORD_number integer NOT NULL,
customer_number integer NOT NULL,
order_date date ,
order_total integer
)
PRIMARY INDEX (customer_number)
PARTITION BY range_n (
Order_date BETWEEN date '2010-01-01' AND date '2010-12-01'
EACH interval '1' month,
NO RANGE
OR UNKNOWN);
P.S: If we use NO RANGE or NO CASE - then all values not in this range will be in a single partition.
If we specify UNKNOWN, then all null values will be placed in this partition
PPI improves performance as follows:
Automatic optimization occurs for queries that specify a restrictive condition on the partitioning column.
Uses partition elimination to improve the efficiency of range searches when, for example, the searches are range partitioned.
Only the rows of the qualified partitions in a query need to be accessed avoid full table scans.
Provides an access path to the rows in the base table while still providing efficient join Strategies
If the same partition is consistently targeted, the part of the table updated may be able to fit largely in cache, significantly boosting performance
PPI based tables have advantage during Purging stages. Since purging based on partition is very fast and deletion happens quickly.
PPI also comes with some disadvantages like
The PI access disadvantage occurs only when the partitioning column is not part of the PI. In this situation, a query specifying a PI value, but no value for the partitioning column, must look in each partition for that value, instead of positioning directly to the first row for the PI value.
Another disadvantage is that when another table (without PPI) is joined with PPI table on PI=PI condition. If one of the tables is partitioned, the rows won't be ordered the same, and the task, in effect, becomes a set of sub-joins, one for each partition of the PPI table. This type of join is sliding window join
Limitations of Partitioned Primary Index (PPI) :
Primary index of PPI table has to be
Non unique PI, if PPI column is not part of Index, since enforcing a Unique PI would require checking for a duplicate key value in each partition, which would be very expensive.
Primary Index of PPI table can be Unique, if PPI is part of UPI. This will result in checking for unique constraint in same partition.
PPI cannot be defined on Global temporary tables and Volatile tables and also on compressed join indices
PPI Table rows occupy two extra bytes compared to NPPI table row, as these extra bytes store the partition number for each row .
PPI table rows are four bytes wider if value compression is specified for the table.
Since PPI , results in lot of partitions , there is a little overhead to user/dba. He has to regularly run collect stats on the PPI column.
It is beneficial to collect stats on Partition column .Collecting stats on the system derived column Partition is faster because rather than reading all the base table rows for collecting information, it usually just scans the cylinder index for that PPI table.By doing so, it avoids all unnecessary partitions , thus speeding up the access.
Help stats tablename column PARTITION; -- used to list partitions in table and their details
Collect stats on tablename column PARTITION; -- refresh partition details
--> Getting Previous Row Data in Teradata using OLAP Teradata Function ?
Getting Previous Row Data in Teradata using OLAP Teradata Function
"ROWS UNBOUNDED PRECEDING" is used in an ordered analytical function to tell it to include all of the preceding rows in the partition in the calculation being performed.
“ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING” is used in an ordered analytical function to tell it to include only preceding row in the partition in the calculation being performed.
“ROWS BETWEEN 1 FOLLOWINGAND 1 FOLLOWING” is used in an ordered analytical function to tell it to include only following row in the partition in the calculation being performed.
Then Following example illustrates OLAP Functions Usage:
CREATE MULTISET TABLE TEMP1 (COL1 INT, COL2 INT);
INSERT INTO TEMP1 VALUES(100 , 2000);
INSERT INTO TEMP1 VALUES(100 , 3000);
INSERT INTO TEMP1 VALUES(100 , 4000);
INSERT INTO TEMP1 VALUES(300 , 5000);
INSERT INTO TEMP1 VALUES(300 , 6000);
INSERT INTO TEMP1 VALUES(300 , 7000);
SELECT COL1
,COL2
,COALESCE(MIN(col2) OVER (PARTITION BY col1 ORDER BY col2 ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), 0) A1
,COALESCE(MAX(col2) OVER (PARTITION BY col1 ORDER BY col2 ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING), 0) A2
,COALESCE(MIN(col2) OVER (PARTITION BY col1 ORDER BY col2 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), 0) A3
,COALESCE(MAX(col2) OVER (PARTITION BY col1 ORDER BY col2 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), 0) A4
,COALESCE(SUM(col2) OVER (PARTITION BY col1 ORDER BY col2 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), 0) A5
,COALESCE(AVG(col2) OVER (PARTITION BY col1 ORDER BY col2 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), 0) A6
,COALESCE(COUNT(col2) OVER (PARTITION BY col1 ORDER BY col2 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), 0) A7
FROM TEMP1;
"ROWS UNBOUNDED PRECEDING" is used in an ordered analytical function to tell it to include all of the preceding rows in the partition in the calculation being performed.
“ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING” is used in an ordered analytical function to tell it to include only preceding row in the partition in the calculation being performed.
“ROWS BETWEEN 1 FOLLOWINGAND 1 FOLLOWING” is used in an ordered analytical function to tell it to include only following row in the partition in the calculation being performed.
Then Following example illustrates OLAP Functions Usage:
CREATE MULTISET TABLE TEMP1 (COL1 INT, COL2 INT);
INSERT INTO TEMP1 VALUES(100 , 2000);
INSERT INTO TEMP1 VALUES(100 , 3000);
INSERT INTO TEMP1 VALUES(100 , 4000);
INSERT INTO TEMP1 VALUES(300 , 5000);
INSERT INTO TEMP1 VALUES(300 , 6000);
INSERT INTO TEMP1 VALUES(300 , 7000);
SELECT COL1
,COL2
,COALESCE(MIN(col2) OVER (PARTITION BY col1 ORDER BY col2 ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING), 0) A1
,COALESCE(MAX(col2) OVER (PARTITION BY col1 ORDER BY col2 ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING), 0) A2
,COALESCE(MIN(col2) OVER (PARTITION BY col1 ORDER BY col2 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), 0) A3
,COALESCE(MAX(col2) OVER (PARTITION BY col1 ORDER BY col2 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), 0) A4
,COALESCE(SUM(col2) OVER (PARTITION BY col1 ORDER BY col2 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), 0) A5
,COALESCE(AVG(col2) OVER (PARTITION BY col1 ORDER BY col2 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), 0) A6
,COALESCE(COUNT(col2) OVER (PARTITION BY col1 ORDER BY col2 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING), 0) A7
FROM TEMP1;
--> Few tricky questions on Explain plan ?
These are some tricky questions asked about explain to see if an candidate has really worked on query tuning and how much he would know about optimizer & explain
1. What does ” Pseudo Table” Locks mean?
It is a false lock which is applied on the table to prevent two users from getting conflicting locks with all-AMP requests.
PE will determine an particular AMP to manage all AMP LOCK requests for given table and Put Pseudo lock on the table.
To put in Simple terms , its like an University with 10 gates and at a given time you can enter the university through one gate ( or main gate) for security check.
2. What is residual condition?
Residual condition means a filter is applied on particular table to limit the number of rows fetched into Spool.
Say for example
Sel * from EMP.Employee where emp_sal > 10000;
Here , residual condition will act upon to fetch only employees with salary greater than 10000
3. Spool “ Last use”
The particular spool file used in the step will be used for last time and the spool space will be released since it is no longer used in further steps of the query execution.
1. What does ” Pseudo Table” Locks mean?
It is a false lock which is applied on the table to prevent two users from getting conflicting locks with all-AMP requests.
PE will determine an particular AMP to manage all AMP LOCK requests for given table and Put Pseudo lock on the table.
To put in Simple terms , its like an University with 10 gates and at a given time you can enter the university through one gate ( or main gate) for security check.
2. What is residual condition?
Residual condition means a filter is applied on particular table to limit the number of rows fetched into Spool.
Say for example
Sel * from EMP.Employee where emp_sal > 10000;
Here , residual condition will act upon to fetch only employees with salary greater than 10000
3. Spool “ Last use”
The particular spool file used in the step will be used for last time and the spool space will be released since it is no longer used in further steps of the query execution.
--> Methods for populating a data warehouse ?
Source tables change over time. A data mart or data warehouse that is based on those tables needs to reflect these changes. This blog post will explain different solutions for solving this problem.
When moving data into a data warehouse, taking it from a source system is the first step in the ETL process. Once extracted from the source, the data can be cleaned and transformed so it can be loaded into a staging table or directly into the data warehouse.
The source system for a data warehouse is typically an online transaction processing (OLTP) application, such as an ERP system, payroll application, order entry system, CRM, etc.
Designing and creating the process to extract the data from the source system is usually the most time-consuming task in the ETL process if not the entire data warehousing process. Source systems are usually very complex, with tables and fields in the databases that are difficult to understand and poorly documented (many popular ERP systems use numbers for table names). This makes determining the data which needs to be extracted a challenge. And usually the data needs to be extracted on a daily basis to supply all changed data to the data warehouse in order to keep it up-to-date. Moreover, the source systems usually cannot be modified, or its performance or availability adjusted, to accommodate the needs of the data warehouse extraction process.
The focus of this blog will be the technical considerations of having different kinds of sources and extraction methods. We will discuss the most common techniques used for extracting data from source databases and the ways to determine the data that has changed since the last extraction.
Designing this process means answering the following two questions:
1) Which extraction method do I choose? This influences the source system, the transportation process, and the time needed for refreshing the data warehouse
2) How do I make the extracted data available for further processing? This influences the transportation method, and the need for cleaning and transforming the data
What are the different extraction methods?
The extraction method is highly dependent on the source system and well as the business needs in the target data warehouse. Usually there is no possibility to add additional logic to the source system to help with the incremental extraction of data.
Here are the types of extraction methods:
Full Extraction: All the data is extracted completely from the source system. Because this extraction reflects all the data currently available on the source system, there is no need to keep track of changes to the source data since the last successful extraction. The source data will be provided as-is and no additional information (i.e., timestamps) is necessary from the source data. An example for a full extraction may be an export file of a complete table or a SQL SELECT statement that retrieves all the rows from a table. Many times a full extraction will be used for tables that will be used as dimension tables in a cube.
Incremental Extraction: Only the data that has changed from a specific point in time in history will be extracted. This point in time may be the time of the last extraction, or a business event like the last day of a fiscal period. To identify this delta change, there must be the possibility to identify all the changed information since this specific point in time (see “How to determine if the data that has changed since the last extraction?” below). Many times an incremental extraction will be used for tables that will be used as fact tables in a cube. After the data is extracted, you can use a control table to store the max date of the extracted records, and then for the next run you will get all the rows from the source system since that max date. Or you can just query the destination table in the data warehouse and return the max date and get all the rows from the source system since that max date, but this method could take a long time if you have a lot of records in the destination table. Instead of the max date, another option is to instead use the max ID if the source system has a unique integer.
For each of these methods, there are two ways to physically extract the data:
Online Extraction: The data is extracted directly from the source system itself. The extraction process can connect directly to the source system to access the source tables themselves, or to an intermediate system that stores the data in a preconfigured manner (i.e., transaction logs or change tables).
Offline Extraction: Many times direct access to the source system is not available, so instead the data is staged outside the original source system and created by an extraction routine. The data is usually in a flat file that is in a defined, generic format. Additional information about the source object is necessary for further processing.
How to determine the data that has changed since the last extraction?
If a data warehouse extracts data from an operational system on a nightly basis, then the data warehouse requires only the data that has changed since the last extraction (that is, the data that has been modified in the past 24 hours). When it is possible to efficiently identify and extract only the most recently changed data, the extraction process (as well as all downstream operations in the ETL process) can be much more efficient, because it must extract a much smaller volume of data. Unfortunately, for many source systems, identifying the recently modified data may be difficult or intrusive to the operation of the system. Incremental extraction is typically the most challenging technical issue in data extraction. Below are several techniques for implementing incremental extraction from source systems. These techniques are based upon the characteristics of the source systems, or may require modifications to the source systems. Thus, each of these techniques must be carefully evaluated by the owners or the source system prior to implementation. Each of these techniques can work in conjunction with the data extraction techniques discussed previously. For example, timestamps can be used whether the data is being pulled from a flat file or accessed through a query to the source system:
Change Data Capture (CDC): Change Data Capture records INSERTs, UPDATEs, and DELETEs applied to SQL Server tables, and makes a record available of what changed, where, and when, in simple relational ‘change tables’. The source of change data for change data capture is the SQL Server transaction log. As inserts, updates, and deletes are applied to tracked source tables, entries that describe those changes are added to the log. The log then serves as input to the change data capture process. CDC reads the transaction log and adds information about changes to the tracked table’s associated change table. See Improving Incremental Loads with Change Data Capture. Note that CDC is the key-enabling technology for providing near real-time data warehousing. But, it is only available in Enterprise edition of SQL Server 2008 or later. So if you are using an older version of SQL Server or your source is in another technology like Oracle, you can’t use CDC.
Timestamps: The tables in some operational systems have timestamp columns. The timestamp specifies the time and date that a given row was last modified, making it easy to identify the latest data. In SQL Server, many times this column is given a timestamp data type, along with a column name of “Timestamp”. Or, the column is given a datetime data type, and a column name of “Last Modified”.
Partitioning: Some source systems might use range partitioning, such that the source tables are partitioned along a date key, which allows for easy identification of new data. For example, if you are extracting from an orders table, and the orders table is partitioned by week, then it is easy to identify the current week’s data.
Database Triggers: adding a trigger for INSERT, UPDATE, and DELETE on a single table and having those triggers write the information about the record change to ‘change tables’. Note this process is similar to Change Data Capture, so use CDC if you are on SQL Server 2008 or later. Otherwise, triggers are an option.
MERGE Statement: Another option is to extract an entire table from the source system to the data warehouse or staging area, and compare these tables with a previous extract from the source system to identify the changed data (using the MERGE statement in SQL Server). This approach will likely not have a significant impact on the source system, but it can place a considerable burden on the data warehouse, particularly if the data volumes are large.
One final idea: To prevent the ETL process that is reading the source system from bogging down a production server, use should think about using replication or database snapshots in SQL Server (or the equivalent technologies for other DBMS). This is especially useful if your ETL process will be running multiple times a day while users are on the system.
Incremental Data Loading Pattern
Stages and left-outer join – Takes much longer to implement and potentially run. This can be done through T-SQL or through an SSIS similar technique without the staging of data.
Control table – Creation of a control table that allows you to check the last load date or ID and only retrieve the rows after that date.
Change Data Capture (CDC) – A mechanism built into SSIS (SQL Server 2012) and SQL Server 2008 Enterprise Edition to detect the rows that have changed if you don’t have a LastModifiedDate type column.
When moving data into a data warehouse, taking it from a source system is the first step in the ETL process. Once extracted from the source, the data can be cleaned and transformed so it can be loaded into a staging table or directly into the data warehouse.
The source system for a data warehouse is typically an online transaction processing (OLTP) application, such as an ERP system, payroll application, order entry system, CRM, etc.
Designing and creating the process to extract the data from the source system is usually the most time-consuming task in the ETL process if not the entire data warehousing process. Source systems are usually very complex, with tables and fields in the databases that are difficult to understand and poorly documented (many popular ERP systems use numbers for table names). This makes determining the data which needs to be extracted a challenge. And usually the data needs to be extracted on a daily basis to supply all changed data to the data warehouse in order to keep it up-to-date. Moreover, the source systems usually cannot be modified, or its performance or availability adjusted, to accommodate the needs of the data warehouse extraction process.
The focus of this blog will be the technical considerations of having different kinds of sources and extraction methods. We will discuss the most common techniques used for extracting data from source databases and the ways to determine the data that has changed since the last extraction.
Designing this process means answering the following two questions:
1) Which extraction method do I choose? This influences the source system, the transportation process, and the time needed for refreshing the data warehouse
2) How do I make the extracted data available for further processing? This influences the transportation method, and the need for cleaning and transforming the data
What are the different extraction methods?
The extraction method is highly dependent on the source system and well as the business needs in the target data warehouse. Usually there is no possibility to add additional logic to the source system to help with the incremental extraction of data.
Here are the types of extraction methods:
Full Extraction: All the data is extracted completely from the source system. Because this extraction reflects all the data currently available on the source system, there is no need to keep track of changes to the source data since the last successful extraction. The source data will be provided as-is and no additional information (i.e., timestamps) is necessary from the source data. An example for a full extraction may be an export file of a complete table or a SQL SELECT statement that retrieves all the rows from a table. Many times a full extraction will be used for tables that will be used as dimension tables in a cube.
Incremental Extraction: Only the data that has changed from a specific point in time in history will be extracted. This point in time may be the time of the last extraction, or a business event like the last day of a fiscal period. To identify this delta change, there must be the possibility to identify all the changed information since this specific point in time (see “How to determine if the data that has changed since the last extraction?” below). Many times an incremental extraction will be used for tables that will be used as fact tables in a cube. After the data is extracted, you can use a control table to store the max date of the extracted records, and then for the next run you will get all the rows from the source system since that max date. Or you can just query the destination table in the data warehouse and return the max date and get all the rows from the source system since that max date, but this method could take a long time if you have a lot of records in the destination table. Instead of the max date, another option is to instead use the max ID if the source system has a unique integer.
For each of these methods, there are two ways to physically extract the data:
Online Extraction: The data is extracted directly from the source system itself. The extraction process can connect directly to the source system to access the source tables themselves, or to an intermediate system that stores the data in a preconfigured manner (i.e., transaction logs or change tables).
Offline Extraction: Many times direct access to the source system is not available, so instead the data is staged outside the original source system and created by an extraction routine. The data is usually in a flat file that is in a defined, generic format. Additional information about the source object is necessary for further processing.
How to determine the data that has changed since the last extraction?
If a data warehouse extracts data from an operational system on a nightly basis, then the data warehouse requires only the data that has changed since the last extraction (that is, the data that has been modified in the past 24 hours). When it is possible to efficiently identify and extract only the most recently changed data, the extraction process (as well as all downstream operations in the ETL process) can be much more efficient, because it must extract a much smaller volume of data. Unfortunately, for many source systems, identifying the recently modified data may be difficult or intrusive to the operation of the system. Incremental extraction is typically the most challenging technical issue in data extraction. Below are several techniques for implementing incremental extraction from source systems. These techniques are based upon the characteristics of the source systems, or may require modifications to the source systems. Thus, each of these techniques must be carefully evaluated by the owners or the source system prior to implementation. Each of these techniques can work in conjunction with the data extraction techniques discussed previously. For example, timestamps can be used whether the data is being pulled from a flat file or accessed through a query to the source system:
Change Data Capture (CDC): Change Data Capture records INSERTs, UPDATEs, and DELETEs applied to SQL Server tables, and makes a record available of what changed, where, and when, in simple relational ‘change tables’. The source of change data for change data capture is the SQL Server transaction log. As inserts, updates, and deletes are applied to tracked source tables, entries that describe those changes are added to the log. The log then serves as input to the change data capture process. CDC reads the transaction log and adds information about changes to the tracked table’s associated change table. See Improving Incremental Loads with Change Data Capture. Note that CDC is the key-enabling technology for providing near real-time data warehousing. But, it is only available in Enterprise edition of SQL Server 2008 or later. So if you are using an older version of SQL Server or your source is in another technology like Oracle, you can’t use CDC.
Timestamps: The tables in some operational systems have timestamp columns. The timestamp specifies the time and date that a given row was last modified, making it easy to identify the latest data. In SQL Server, many times this column is given a timestamp data type, along with a column name of “Timestamp”. Or, the column is given a datetime data type, and a column name of “Last Modified”.
Partitioning: Some source systems might use range partitioning, such that the source tables are partitioned along a date key, which allows for easy identification of new data. For example, if you are extracting from an orders table, and the orders table is partitioned by week, then it is easy to identify the current week’s data.
Database Triggers: adding a trigger for INSERT, UPDATE, and DELETE on a single table and having those triggers write the information about the record change to ‘change tables’. Note this process is similar to Change Data Capture, so use CDC if you are on SQL Server 2008 or later. Otherwise, triggers are an option.
MERGE Statement: Another option is to extract an entire table from the source system to the data warehouse or staging area, and compare these tables with a previous extract from the source system to identify the changed data (using the MERGE statement in SQL Server). This approach will likely not have a significant impact on the source system, but it can place a considerable burden on the data warehouse, particularly if the data volumes are large.
One final idea: To prevent the ETL process that is reading the source system from bogging down a production server, use should think about using replication or database snapshots in SQL Server (or the equivalent technologies for other DBMS). This is especially useful if your ETL process will be running multiple times a day while users are on the system.
Incremental Data Loading Pattern
Stages and left-outer join – Takes much longer to implement and potentially run. This can be done through T-SQL or through an SSIS similar technique without the staging of data.
Control table – Creation of a control table that allows you to check the last load date or ID and only retrieve the rows after that date.
Change Data Capture (CDC) – A mechanism built into SSIS (SQL Server 2012) and SQL Server 2008 Enterprise Edition to detect the rows that have changed if you don’t have a LastModifiedDate type column.
--> Data Warehouse vs Data Mar ?
I see a lot of confusion on what exactly is the difference between a data warehouse and a data mart. The best definition that I have heard of a data warehouse is:
“A relational database schema which stores historical data and metadata from an operational system or systems, in such a way as to facilitate the reporting and analysis of the data, aggregated to various levels”.
Or more simply:
“A single organizational repository of enterprise wide data across many or all subject areas”.
Typical data warehouses have these characteristics:
Holds multiple subject areas
Holds very detailed information
Works to integrate all data sources
Does not necessarily use a dimensional model but feeds dimensional models.
On the other hand, a data mart is the access layer of the data warehouse environment that is used to get data out to the users. The data mart is a subset of the data warehouse which is usually oriented to a specific business line or team.
According to the Inmon school of data warehousing, a dependent data mart is a logical subset (view) or a physical subset (extract) of a larger data warehouse, usually isolated for the need to have a special data model or schema (e.g., to restructure for OLAP). One of the benefits of the new Tabular mode in SSAS is that you can build that on top of a data warehouse instead of a data mart, saving time by not having to build a data mart.
So in short, I like to think of a data warehouse as containing many subject areas, and a data mart as containing just one of those subject areas.
More info:
Data Mart vs Data Warehouse – The Great Debate
Data Warehouse Architecture – Kimball and Inmon methodologies
Data Mart Does Not Equal Data Warehouse
Data mart or data warehouse?
Data Warehouse – Data Mart
Data Warehouse vs Data Mart
“A relational database schema which stores historical data and metadata from an operational system or systems, in such a way as to facilitate the reporting and analysis of the data, aggregated to various levels”.
Or more simply:
“A single organizational repository of enterprise wide data across many or all subject areas”.
Typical data warehouses have these characteristics:
Holds multiple subject areas
Holds very detailed information
Works to integrate all data sources
Does not necessarily use a dimensional model but feeds dimensional models.
On the other hand, a data mart is the access layer of the data warehouse environment that is used to get data out to the users. The data mart is a subset of the data warehouse which is usually oriented to a specific business line or team.
According to the Inmon school of data warehousing, a dependent data mart is a logical subset (view) or a physical subset (extract) of a larger data warehouse, usually isolated for the need to have a special data model or schema (e.g., to restructure for OLAP). One of the benefits of the new Tabular mode in SSAS is that you can build that on top of a data warehouse instead of a data mart, saving time by not having to build a data mart.
So in short, I like to think of a data warehouse as containing many subject areas, and a data mart as containing just one of those subject areas.
More info:
Data Mart vs Data Warehouse – The Great Debate
Data Warehouse Architecture – Kimball and Inmon methodologies
Data Mart Does Not Equal Data Warehouse
Data mart or data warehouse?
Data Warehouse – Data Mart
Data Warehouse vs Data Mart
--> Top Five Benefits of a Data Warehouse ?
According to The Data Warehouse Institute, a data warehouse is the foundation for a successful BI program. The concept of data warehousing is pretty easy to understand—to create a central location and permanent storage space for the various data sources needed to support a company’s analysis, reporting and other BI functions.
And it’s really important for your business.
But a data warehouse also costs money — big money. The problem is when big money is involved it’s tough to justify spending it on any project, especially when you can’t really quantify the benefits upfront. When it comes to a data warehouse, it’s not easy to know what the benefits are until it’s up and running. According to BI-Insider.com, here are the key benefits of a data warehouse once it’s launched.
A Data Warehouse Delivers Enhanced Business Intelligence
By providing data from various sources, managers and executives will no longer need to make business decisions based on limited data or their gut. In addition, “data warehouses and related BI can be applied directly to business processes including marketing segmentation, inventory management, financial management, and sales.”
A Data Warehouse Saves Time
Since business users can quickly access critical data from a number of sources—all in one place—they can rapidly make informed decisions on key initiatives. They won’t waste precious time retrieving data from multiple sources.
Not only that but the business execs can query the data themselves with little or no support from IT—saving more time and more money. That means the business users won’t have to wait until IT gets around to generating the reports, and those hardworking folks in IT can do what they do best—keep the business running.
A Data Warehouse Enhances Data Quality and Consistency
A data warehouse implementation includes the conversion of data from numerous source systems into a common format. Since each data from the various departments is standardized, each department will produce results that are in line with all the other departments. So you can have more confidence in the accuracy of your data. And accurate data is the basis for strong business decisions.
A Data Warehouse Provides Historical Intelligence
A data warehouse stores large amounts of historical data so you can analyze different time periods and trends in order to make future predictions. Such data typically cannot be stored in a transactional database or used to generate reports from a transactional system.
A Data Warehouse Generates a High ROI
Finally, the piece de resistance — return on investment. Companies that have implemented data warehouses and complementary BI systems have generated more revenue and saved more money than companies that haven’t invested in BI systems and data warehouses.
And that should be reason enough for senior management to jump on the data warehouse bandwagon.
And it’s really important for your business.
But a data warehouse also costs money — big money. The problem is when big money is involved it’s tough to justify spending it on any project, especially when you can’t really quantify the benefits upfront. When it comes to a data warehouse, it’s not easy to know what the benefits are until it’s up and running. According to BI-Insider.com, here are the key benefits of a data warehouse once it’s launched.
A Data Warehouse Delivers Enhanced Business Intelligence
By providing data from various sources, managers and executives will no longer need to make business decisions based on limited data or their gut. In addition, “data warehouses and related BI can be applied directly to business processes including marketing segmentation, inventory management, financial management, and sales.”
A Data Warehouse Saves Time
Since business users can quickly access critical data from a number of sources—all in one place—they can rapidly make informed decisions on key initiatives. They won’t waste precious time retrieving data from multiple sources.
Not only that but the business execs can query the data themselves with little or no support from IT—saving more time and more money. That means the business users won’t have to wait until IT gets around to generating the reports, and those hardworking folks in IT can do what they do best—keep the business running.
A Data Warehouse Enhances Data Quality and Consistency
A data warehouse implementation includes the conversion of data from numerous source systems into a common format. Since each data from the various departments is standardized, each department will produce results that are in line with all the other departments. So you can have more confidence in the accuracy of your data. And accurate data is the basis for strong business decisions.
A Data Warehouse Provides Historical Intelligence
A data warehouse stores large amounts of historical data so you can analyze different time periods and trends in order to make future predictions. Such data typically cannot be stored in a transactional database or used to generate reports from a transactional system.
A Data Warehouse Generates a High ROI
Finally, the piece de resistance — return on investment. Companies that have implemented data warehouses and complementary BI systems have generated more revenue and saved more money than companies that haven’t invested in BI systems and data warehouses.
And that should be reason enough for senior management to jump on the data warehouse bandwagon.
--> 10 Essential Rules of Dimensional Modeling ?
Follow the rules to ensure granular data, flexibility and a future-proofed information resource. Break the rules and you'll confuse users and run into data warehousing brick walls.
Rule #1: Load detailed atomic data into dimensional structures.
Dimensional models should be populated with bedrock atomic details to support the unpredictable filtering and grouping required by business user queries. Users typically don't need to see a single record at a time, but you can't predict the somewhat arbitrary ways they'll want to screen and roll up the details. If only summarized data is available, then you've already made assumptions about data usage patterns that will cause users to run into a brick wall when they want to dig deeper into the details. Of course, atomic details can be complemented by summary dimensional models that provide performance advantages for common queries of aggregated data, but business users cannot live on summary data alone; they need the gory details to answer their ever-changing questions.
Rule #2: Structure dimensional models around business processes.
Business processes are the activities performed by your organization; they represent measurement events, like taking an order or billing a customer. Business processes typically capture or generate unique performance metrics associated with each event. These metrics translate into facts, with each business process represented by a single atomic fact table. In addition to single process fact tables, consolidated fact tables are sometimes created that combine metrics from multiple processes into one fact table at a common level of detail. Again, consolidated fact tables are a complement to the detailed single-process fact tables, not a substitute for them.
Rule #3: Ensure that every fact table has an associated date dimension table.
The measurement events described in Rule #2 always have a date stamp of some variety associated with them, whether it's a monthly balance snapshot or a monetary transfer captured to the hundredth of a second. Every fact table should have at least one foreign key to an associated date dimension table, whose grain is a single day, with calendar attributes and nonstandard characteristics about the measurement event date, such as the fiscal month and corporate holiday indicator. Sometimes multiple date foreign keys are represented in a fact table.
Rule #4: Ensure that all facts in a single fact table are at the same grain or level of detail.
There are three fundamental grains to categorize all fact tables: transactional, periodic snapshot, or accumulating snapshot. Regardless of its grain type, every measurement within a fact table must be at the exact same level of detail. When you mix facts representing multiple levels of granularity in the same fact table, you are setting yourself up for business user confusion and making the BI applications vulnerable to overstated or otherwise erroneous results.
Rule #5: Resove many-to-many relationships in fact tables.
Since a fact table stores the results of a business process event, there's inherently a many-to-many (M:M) relationship between its foreign keys, such as multiple products being sold in multiple stores on multiple days. These foreign key fields should never be null. Sometimes dimensions can take on multiple values for a single measurement event, such as the multiple diagnoses associated with a health care encounter or multiple customers with a bank account. In these cases, it's unreasonable to resolve the many-valued dimensions directly in the fact table, as this would violate the natural grain of the measurement event. Thus, we use a many-to-many, dual-keyed bridge table in conjunction with the fact table.
Rule #6: Resolve many-to-one relationships in dimension tables.
Hierarchical, fixed-depth many-to-one (M:1) relationships between attributes are typically denormalized or collapsed into a flattened dimension table. If you've spent most of your career designing entity-relationship models for transaction processing systems, you'll need to resist your instinctive tendency to normalize or snowflake a M:1 relationship into smaller subdimensions; dimension denormalization is the name of the game in dimensional modeling.
It is relatively common to have multiple M:1 relationships represented in a single dimension table. One-to-one relationships, like a unique product description associated with a product code, are also handled in a dimension table. Occasionally many-to-one relationships are resolved in the fact table, such as the case when the detailed dimension table has millions of rows and its roll-up attributes are frequently changing. However, using the fact table to resolve M:1 relationships should be done sparingly.
Rule #7: Store report labels and filter domain values in dimension tables.
The codes and, more importantly, associated decodes and descriptors used for labeling and query filtering should be captured in dimension tables. Avoid storing cryptic code fields or bulky descriptive fields in the fact table itself; likewise, don't just store the code in the dimension table and assume that users don't need descriptive decodes or that they'll be handled in the BI application. If it's a row/column label or pull-down menu filter, then it should be handled as a dimension attribute.
Though we stated in Rule #5 that fact table foreign keys should never be null, it's also advisable to avoid nulls in the dimension tables' attribute fields by replacing the null value with "NA" (not applicable) or another default value, determined by the data steward, to reduce user confusion if possible.
Rule #8: Make certain that dimension tables use a surrogate key.
Meaningless, sequentially assigned surrogate keys (except for the date dimension, where chronologically assigned and even more meaningful keys are acceptable) deliver a number of operational benefits, including smaller keys which mean smaller fact tables, smaller indexes, and improved performance. Surrogate keys are absolutely required if you're tracking dimension attribute changes with a new dimension record for each profile change. Even if your business users don't initially visualize the value of tracking attribute changes, using surrogates will make a downstream policy change less onerous. The surrogates also allow you to map multiple operational keys to a common profile, plus buffer you from unexpected operational activities, like the recycling of an obsolete product number or acquisition of another company with its own coding schemes.
Rule #9: Create conformed dimensions to integrate data across the enterprise.
Conformed dimensions (otherwise known as common, master, standard or reference dimensions) are essential for enterprise data warehousing. Managed once in the ETL system and then reused across multiple fact tables, conformed dimensions deliver consistent descriptive attributes across dimensional models and support the ability to drill across and integrate data from multiple business processes. The Enterprise Data Warehouse Bus Matrix is the key architecture blueprint for representing the organization's core business processes and associated dimensionality. Reusing conformed dimensions ultimately shortens the time-to-market by eliminating redundant design and development efforts; however, conformed dimensions require a commitment and investment in data stewardship and governance, even if you don't need everyone to agree on every dimension attribute to leverage conformity.
Rule #10: Continuously balance requirements and realities to deliver a DW/BI solution that's accepted by business users and that supports their decision-making.
Dimensional modelers must constantly straddle business user requirements along with the underlying realities of the associated source data to deliver a design that can be implemented and that, more importantly, stands a reasonable chance of business adoption. The requirements-versus-realities balancing act is a fact of life for DW/BI practitioners, whether you're focused on the dimensional model, project strategy, technical/ETL/BI architectures or deployment/maintenance plan.
Rule #1: Load detailed atomic data into dimensional structures.
Dimensional models should be populated with bedrock atomic details to support the unpredictable filtering and grouping required by business user queries. Users typically don't need to see a single record at a time, but you can't predict the somewhat arbitrary ways they'll want to screen and roll up the details. If only summarized data is available, then you've already made assumptions about data usage patterns that will cause users to run into a brick wall when they want to dig deeper into the details. Of course, atomic details can be complemented by summary dimensional models that provide performance advantages for common queries of aggregated data, but business users cannot live on summary data alone; they need the gory details to answer their ever-changing questions.
Rule #2: Structure dimensional models around business processes.
Business processes are the activities performed by your organization; they represent measurement events, like taking an order or billing a customer. Business processes typically capture or generate unique performance metrics associated with each event. These metrics translate into facts, with each business process represented by a single atomic fact table. In addition to single process fact tables, consolidated fact tables are sometimes created that combine metrics from multiple processes into one fact table at a common level of detail. Again, consolidated fact tables are a complement to the detailed single-process fact tables, not a substitute for them.
Rule #3: Ensure that every fact table has an associated date dimension table.
The measurement events described in Rule #2 always have a date stamp of some variety associated with them, whether it's a monthly balance snapshot or a monetary transfer captured to the hundredth of a second. Every fact table should have at least one foreign key to an associated date dimension table, whose grain is a single day, with calendar attributes and nonstandard characteristics about the measurement event date, such as the fiscal month and corporate holiday indicator. Sometimes multiple date foreign keys are represented in a fact table.
Rule #4: Ensure that all facts in a single fact table are at the same grain or level of detail.
There are three fundamental grains to categorize all fact tables: transactional, periodic snapshot, or accumulating snapshot. Regardless of its grain type, every measurement within a fact table must be at the exact same level of detail. When you mix facts representing multiple levels of granularity in the same fact table, you are setting yourself up for business user confusion and making the BI applications vulnerable to overstated or otherwise erroneous results.
Rule #5: Resove many-to-many relationships in fact tables.
Since a fact table stores the results of a business process event, there's inherently a many-to-many (M:M) relationship between its foreign keys, such as multiple products being sold in multiple stores on multiple days. These foreign key fields should never be null. Sometimes dimensions can take on multiple values for a single measurement event, such as the multiple diagnoses associated with a health care encounter or multiple customers with a bank account. In these cases, it's unreasonable to resolve the many-valued dimensions directly in the fact table, as this would violate the natural grain of the measurement event. Thus, we use a many-to-many, dual-keyed bridge table in conjunction with the fact table.
Rule #6: Resolve many-to-one relationships in dimension tables.
Hierarchical, fixed-depth many-to-one (M:1) relationships between attributes are typically denormalized or collapsed into a flattened dimension table. If you've spent most of your career designing entity-relationship models for transaction processing systems, you'll need to resist your instinctive tendency to normalize or snowflake a M:1 relationship into smaller subdimensions; dimension denormalization is the name of the game in dimensional modeling.
It is relatively common to have multiple M:1 relationships represented in a single dimension table. One-to-one relationships, like a unique product description associated with a product code, are also handled in a dimension table. Occasionally many-to-one relationships are resolved in the fact table, such as the case when the detailed dimension table has millions of rows and its roll-up attributes are frequently changing. However, using the fact table to resolve M:1 relationships should be done sparingly.
Rule #7: Store report labels and filter domain values in dimension tables.
The codes and, more importantly, associated decodes and descriptors used for labeling and query filtering should be captured in dimension tables. Avoid storing cryptic code fields or bulky descriptive fields in the fact table itself; likewise, don't just store the code in the dimension table and assume that users don't need descriptive decodes or that they'll be handled in the BI application. If it's a row/column label or pull-down menu filter, then it should be handled as a dimension attribute.
Though we stated in Rule #5 that fact table foreign keys should never be null, it's also advisable to avoid nulls in the dimension tables' attribute fields by replacing the null value with "NA" (not applicable) or another default value, determined by the data steward, to reduce user confusion if possible.
Rule #8: Make certain that dimension tables use a surrogate key.
Meaningless, sequentially assigned surrogate keys (except for the date dimension, where chronologically assigned and even more meaningful keys are acceptable) deliver a number of operational benefits, including smaller keys which mean smaller fact tables, smaller indexes, and improved performance. Surrogate keys are absolutely required if you're tracking dimension attribute changes with a new dimension record for each profile change. Even if your business users don't initially visualize the value of tracking attribute changes, using surrogates will make a downstream policy change less onerous. The surrogates also allow you to map multiple operational keys to a common profile, plus buffer you from unexpected operational activities, like the recycling of an obsolete product number or acquisition of another company with its own coding schemes.
Rule #9: Create conformed dimensions to integrate data across the enterprise.
Conformed dimensions (otherwise known as common, master, standard or reference dimensions) are essential for enterprise data warehousing. Managed once in the ETL system and then reused across multiple fact tables, conformed dimensions deliver consistent descriptive attributes across dimensional models and support the ability to drill across and integrate data from multiple business processes. The Enterprise Data Warehouse Bus Matrix is the key architecture blueprint for representing the organization's core business processes and associated dimensionality. Reusing conformed dimensions ultimately shortens the time-to-market by eliminating redundant design and development efforts; however, conformed dimensions require a commitment and investment in data stewardship and governance, even if you don't need everyone to agree on every dimension attribute to leverage conformity.
Rule #10: Continuously balance requirements and realities to deliver a DW/BI solution that's accepted by business users and that supports their decision-making.
Dimensional modelers must constantly straddle business user requirements along with the underlying realities of the associated source data to deliver a design that can be implemented and that, more importantly, stands a reasonable chance of business adoption. The requirements-versus-realities balancing act is a fact of life for DW/BI practitioners, whether you're focused on the dimensional model, project strategy, technical/ETL/BI architectures or deployment/maintenance plan.
--> Query to check what all access given for the databases by User Name.?
Step 1:
CREATE VOLATILE SET TABLE AccessRightsAbbv
(
Description VARCHAR(50),
AccessRight CHAR(2)
)PRIMARY INDEX(AccessRight) ON COMMIT PRESERVE ROWS;
ins AccessRightsAbbv('CHECKPOINT','CP');
ins AccessRightsAbbv('CREATE AUTHORIZATION','CA');
ins AccessRightsAbbv('CREATE MACRO','CM');
ins AccessRightsAbbv('CREATE PROCEDURE','PC');
ins AccessRightsAbbv('CREATE TABLE','CT');
ins AccessRightsAbbv('CREATE TRIGGER','CG');
ins AccessRightsAbbv('CREATE VIEW','CV');
ins AccessRightsAbbv('DELETE','D');
ins AccessRightsAbbv('DROP AUTHORIZATION','DA');
ins AccessRightsAbbv('DROP FUNCTION ','DF');
ins AccessRightsAbbv('DROP MACRO','DM');
ins AccessRightsAbbv('DROP PROCEDURE','PD');
ins AccessRightsAbbv('DROP TRIGGER','DG');
ins AccessRightsAbbv('DROP VIEW','DV');
ins AccessRightsAbbv('DUMP','DP');
ins AccessRightsAbbv('INSERT','I');
ins AccessRightsAbbv('RESTORE','RS');
ins AccessRightsAbbv('SELECT','R');
ins AccessRightsAbbv('UPDATE','U');
ins AccessRightsAbbv('CREATE DATABASE','CD');
ins AccessRightsAbbv('CREATE USER','CU');
ins AccessRightsAbbv('DROP DATABASE','DD');
ins AccessRightsAbbv('DROP USER','DU');
ins AccessRightsAbbv('ALTER EXTERNAL PROCEDURE','AE');
ins AccessRightsAbbv('ALTER FUNCTION','AF');
ins AccessRightsAbbv('ALTER PROCEDURE','AP');
ins AccessRightsAbbv('CREATE EXTERNAL PROCEDURE','CE');
ins AccessRightsAbbv('CREATE FUNCTION','CF');
ins AccessRightsAbbv('EXECUTE FUNCTION','EF');
ins AccessRightsAbbv('EXECUTE PROCEDURE','PE');
ins AccessRightsAbbv('EXECUTE','E');
ins AccessRightsAbbv('DROP TABLE','DT');
ins AccessRightsAbbv('INDEX','IX');
ins AccessRightsAbbv('REFERENCES','RF');
ins AccessRightsAbbv('ABORT SESSION','AS');
ins AccessRightsAbbv('MONRESOURCE','MR');
ins AccessRightsAbbv('MONSESSION','MS');
ins AccessRightsAbbv('SETRESRATE','SR');
ins AccessRightsAbbv('SETSESSRATE','SS');
ins AccessRightsAbbv('CREATE PROFILE','CO');
ins AccessRightsAbbv('CREATE ROLE','CR');
ins AccessRightsAbbv('DROP PROFILE','DO');
ins AccessRightsAbbv('DROP ROLE','DR');
ins AccessRightsAbbv('REPLCONTROL','RO');
Step 2:
SELECT
A.ROLENAME
,A.GRANTEE
--,A.GRANTOR
--,A.DefaultRole,A.WithAdmin
,B.DATABASENAME
--,B.TABLENAME,B.COLUMNNAME,B.GRANTORNAME
,B.AccessRight
,CASE WHEN c.AccessRight IS NOT NULL THEN c.Description
ELSE b.AccessRight END AS AccessRight
FROM DBC.ROLEMEMBERS A
INNER JOIN DBC.ALLROLERIGHTS B
ON A.ROLENAME = B.ROLENAME
LEFT OUTER JOIN AccessRightsAbbv C
ON B.AccessRight = C.AccessRight
WHERE Grantee in (‘User Name’)
AND B.DATABASENAME IN
(
‘List of DB Names’
)
GROUP BY 1,2,3,4,5--,6,7,8,9,10,11
ORDER BY 2,1,3,4,5;
SELECT
A.ROLENAME
,A.GRANTEE
--,A.GRANTOR
--,A.DefaultRole,A.WithAdmin
,B.DATABASENAME
--,B.TABLENAME,B.COLUMNNAME,B.GRANTORNAME
,B.AccessRight
,CASE WHEN c.AccessRight IS NOT NULL THEN c.Description
ELSE b.AccessRight END AS AccessRight
FROM DBC.ROLEMEMBERS A
INNER JOIN DBC.ALLROLERIGHTS B
ON A.ROLENAME = B.ROLENAME
LEFT OUTER JOIN AccessRightsAbbv C
ON B.AccessRight = C.AccessRight
WHERE Grantee in (‘User Name’)
AND B.DATABASENAME IN
(
‘List of DB Names’
)
GROUP BY 1,2,3,4,5--,6,7,8,9,10,11
ORDER BY 2,1,3,4,5;
CREATE VOLATILE SET TABLE AccessRightsAbbv
(
Description VARCHAR(50),
AccessRight CHAR(2)
)PRIMARY INDEX(AccessRight) ON COMMIT PRESERVE ROWS;
ins AccessRightsAbbv('CHECKPOINT','CP');
ins AccessRightsAbbv('CREATE AUTHORIZATION','CA');
ins AccessRightsAbbv('CREATE MACRO','CM');
ins AccessRightsAbbv('CREATE PROCEDURE','PC');
ins AccessRightsAbbv('CREATE TABLE','CT');
ins AccessRightsAbbv('CREATE TRIGGER','CG');
ins AccessRightsAbbv('CREATE VIEW','CV');
ins AccessRightsAbbv('DELETE','D');
ins AccessRightsAbbv('DROP AUTHORIZATION','DA');
ins AccessRightsAbbv('DROP FUNCTION ','DF');
ins AccessRightsAbbv('DROP MACRO','DM');
ins AccessRightsAbbv('DROP PROCEDURE','PD');
ins AccessRightsAbbv('DROP TRIGGER','DG');
ins AccessRightsAbbv('DROP VIEW','DV');
ins AccessRightsAbbv('DUMP','DP');
ins AccessRightsAbbv('INSERT','I');
ins AccessRightsAbbv('RESTORE','RS');
ins AccessRightsAbbv('SELECT','R');
ins AccessRightsAbbv('UPDATE','U');
ins AccessRightsAbbv('CREATE DATABASE','CD');
ins AccessRightsAbbv('CREATE USER','CU');
ins AccessRightsAbbv('DROP DATABASE','DD');
ins AccessRightsAbbv('DROP USER','DU');
ins AccessRightsAbbv('ALTER EXTERNAL PROCEDURE','AE');
ins AccessRightsAbbv('ALTER FUNCTION','AF');
ins AccessRightsAbbv('ALTER PROCEDURE','AP');
ins AccessRightsAbbv('CREATE EXTERNAL PROCEDURE','CE');
ins AccessRightsAbbv('CREATE FUNCTION','CF');
ins AccessRightsAbbv('EXECUTE FUNCTION','EF');
ins AccessRightsAbbv('EXECUTE PROCEDURE','PE');
ins AccessRightsAbbv('EXECUTE','E');
ins AccessRightsAbbv('DROP TABLE','DT');
ins AccessRightsAbbv('INDEX','IX');
ins AccessRightsAbbv('REFERENCES','RF');
ins AccessRightsAbbv('ABORT SESSION','AS');
ins AccessRightsAbbv('MONRESOURCE','MR');
ins AccessRightsAbbv('MONSESSION','MS');
ins AccessRightsAbbv('SETRESRATE','SR');
ins AccessRightsAbbv('SETSESSRATE','SS');
ins AccessRightsAbbv('CREATE PROFILE','CO');
ins AccessRightsAbbv('CREATE ROLE','CR');
ins AccessRightsAbbv('DROP PROFILE','DO');
ins AccessRightsAbbv('DROP ROLE','DR');
ins AccessRightsAbbv('REPLCONTROL','RO');
Step 2:
SELECT
A.ROLENAME
,A.GRANTEE
--,A.GRANTOR
--,A.DefaultRole,A.WithAdmin
,B.DATABASENAME
--,B.TABLENAME,B.COLUMNNAME,B.GRANTORNAME
,B.AccessRight
,CASE WHEN c.AccessRight IS NOT NULL THEN c.Description
ELSE b.AccessRight END AS AccessRight
FROM DBC.ROLEMEMBERS A
INNER JOIN DBC.ALLROLERIGHTS B
ON A.ROLENAME = B.ROLENAME
LEFT OUTER JOIN AccessRightsAbbv C
ON B.AccessRight = C.AccessRight
WHERE Grantee in (‘User Name’)
AND B.DATABASENAME IN
(
‘List of DB Names’
)
GROUP BY 1,2,3,4,5--,6,7,8,9,10,11
ORDER BY 2,1,3,4,5;
SELECT
A.ROLENAME
,A.GRANTEE
--,A.GRANTOR
--,A.DefaultRole,A.WithAdmin
,B.DATABASENAME
--,B.TABLENAME,B.COLUMNNAME,B.GRANTORNAME
,B.AccessRight
,CASE WHEN c.AccessRight IS NOT NULL THEN c.Description
ELSE b.AccessRight END AS AccessRight
FROM DBC.ROLEMEMBERS A
INNER JOIN DBC.ALLROLERIGHTS B
ON A.ROLENAME = B.ROLENAME
LEFT OUTER JOIN AccessRightsAbbv C
ON B.AccessRight = C.AccessRight
WHERE Grantee in (‘User Name’)
AND B.DATABASENAME IN
(
‘List of DB Names’
)
GROUP BY 1,2,3,4,5--,6,7,8,9,10,11
ORDER BY 2,1,3,4,5;
--> Easy way to get Syntax for a command in teradata ?
Easy way to get Syntax for a command in teradata
exec dbc.getsyntax('sql','create table');
getsyntax is a macro that accepts 2 parameters.
First parameter is the type of utility that the command belongs to.
valid values includes the following.
PMPC
BULKLOAD
DUMP
ARCHIVE
MULTILOAD
VERSION
SPL
FASTEXPORT
DATA_TYPE
HELP
SQL
FASTLOAD
TPCCONS
Second parameter the command name for which we need to find the syntax.
examples includes:
exec dbc.getsyntax('FASTLOAD','DEFINE');
exec dbc.getsyntax('MULTILOAD','.IMPORT');
exec dbc.getsyntax('SQL','UPDATE');
REPLACE MACRO DBC.GetSyntax
(
TheFunction (CHAR(30)),
Command (CHAR(30))
)
AS
(
SELECT Syntax FROM SysAdmin.HelpSyntax
WHERE TheFunction = :TheFunction AND
Command = :Command
ORDER BY LineNumber;
);
exec dbc.getsyntax('sql','create table');
getsyntax is a macro that accepts 2 parameters.
First parameter is the type of utility that the command belongs to.
valid values includes the following.
PMPC
BULKLOAD
DUMP
ARCHIVE
MULTILOAD
VERSION
SPL
FASTEXPORT
DATA_TYPE
HELP
SQL
FASTLOAD
TPCCONS
Second parameter the command name for which we need to find the syntax.
examples includes:
exec dbc.getsyntax('FASTLOAD','DEFINE');
exec dbc.getsyntax('MULTILOAD','.IMPORT');
exec dbc.getsyntax('SQL','UPDATE');
REPLACE MACRO DBC.GetSyntax
(
TheFunction (CHAR(30)),
Command (CHAR(30))
)
AS
(
SELECT Syntax FROM SysAdmin.HelpSyntax
WHERE TheFunction = :TheFunction AND
Command = :Command
ORDER BY LineNumber;
);
--> Teradata Errorcodes and Description?
Teradata DBC Errorcodes and Description
ErrorCode ErrorText
0 **** ErrMsgs file as of 07:11:26 04/05/2004.
2507 Out of spool space on disk.
2566 Processor nnnn failed to respond
2585 An I/O error on an MLOAD table caused a crash.
2587 Segment overflow -- FastExport request is too complex to process.
2589 Request block is not found.
2606 Bad argument involving %TVMID.%FLDID for LOG function.
2608 Bad argument involving %TVMID.%FLDID for LN function
2610 Internal error: field maintenance error
2627 Internal error: invalid operand for UpCaseOp
2629 Internal error: bad data type for operation.
2648 Internal error: could not find spool header.
2650 Numeric processor operand error.
2669 Outrow reference to source is illegal case (in MoveRow).
2671 Unable to find a source constant (in MoveRow).
2688 Numeric Processor Operand Error.
2690 Zero table length in Unload.
2709 SSSWAIT has timed out.
2711 Inconsistent AMP/PE Vproc count.
2728 Aborted due to a non-valid reference index on %DBID.%TVMID.
2730 Warning: RI is not valid on one or more AMPs.
2732 Warning: Skipping parent key validation in level 3 checking.
2749 Primary and fallback USI counts do not match
2751 Data row out of order.
2770 Data row id referenced by NUSI is on wrong AMP.
2772 NUSI row indexes non existent data row.
2791 Primary and fallback data row checksums do not match.
2793 MLoad MARK DUPLICATE INSERT (dup row)
2810 Internal error: permanent journal error found.
2812 Internal error: invalid record found on transient journal.
2814 Data size exceeds the maximum specified.
2831 No row was found for the event in the Recovery Control Catalog.
2833 Row already exists with different content.
2852 Inconsistent check point records found on Restored journal.
2854 Specified checkpoint row does not exist in Journal table %TVMID.
2871 Missing table header for DBC.DBCInfoTbl, updates bypassed.
2873 Pending operation on DBC.DBCInfoTbl prevented updates.
2875 Task Lock cannot be granted - resource busy
2892 Null value eliminated in set function
2894 The cursor spool row specified is invalid.
2913 A specific-AMP recovery was executed out of order.
2915 The journal for the updated table does not exist
2934 Input is in new row format -- expected old format.
2936 No row was found for Bulk Load delete/update.
2955 Invalid before image on transient journal.
2974 Invalid backup message received.
2976 Message with invalid kind has been received.
2995 Duplicate field in into specification.
2997 Requested field does not exist in row.
3016 Userid, password or account > 30 characters.
3018 RAR response is malformed.
3035 DBC.Sessiontbl is corrupted, session cannot be recovered.
3117 The continue request contains invalid parcels.
3119 Continue request submitted but no response to return.
3136 Internal Error: The request cannot be aborted, session is in-doubt.
3138 Waiting to load in-doubt session information
3140 Request has already been committed
3157 DQM Cache invalid or size exceeded
3159 DQM internal error
3180 Service console programming fault detected.
3220 RCT program fault.
3239 Invalid RepBegin message received.
3241 MDS database does not exist.
3260 The LOGOFF SESSIONS option must have a value of Y,y,N,n,BLANK or NULL.
3262 The LOCAL CHANGE option must have a value of Y,y,N,n,BLANK or NULL.
3279 There are no more Monitor Sessions available.
3281 Internal Error: Inconsistent Monitor software condition detected.
3283 Internal Error: No free context blocks are available.
3403 Request parcel contains fewer session ids than specified.
3405 Session %VSTR not found or does not have an in-doubt transaction.
3504 Selected non-aggregate values must be part of the associated group.
3506 Internal error: Tree segment contains errors.
3525 The user cannot create or drop a PRIMARY index.
3527 Format string '%VSTR' has combination of numeric, character and GRAPHIC values.
3544 Partial string matching requires character operands.
3546 Default values may not be parameterized.
3548 The user cannot run FastLoad and MLoad at the same time.
3565 Database has tables without After Image Journaling.
3567 Deleting database 'DBC' is not allowed.
3586 Processor specification error(too many or invalid values).
3588 A User may not be assigned a NULL password.
3607 Too many values in a subquery.
3609 Excessively complex set conditional.
3626 Internal error: WITH BY clause contains multiple tables.
3628 Only identical Aggregated Expressions can use the DISTINCT option.
3647 NULL may not be used as the argument for an aggregate function.
3649 Compressed NULL and NOT NULL cannot coexist for column %VSTR.
3668 A table listed in the Dictionary is not present.
3670 '%VSTR' is not a User.
3687 Special characters may not be used in passwords.
3689 END EXPORT statement is valid only if in FastExport Task.
3708 Syntax error, %VSTR should be deleted.
3710 Insufficient memory to parse this request, during %VSTR phase.
3729 Invalid use of non-SQL field definition(s) in ALTER TABLE.
3731 The user must use IS NULL or IS NOT NULL to test for NULL values.
3748 Parcel kind or ordering is invalid.
3750 Illegal Kanji object name %VSTR.
3752 Too many digits in exponent.
3769 Checksum locking modifier must reference the same table as the SELECT statement that follows it.
3771 Illegal expression in WHEN clause of CASE expression.
3790 GRANT/REVOKE LOGON not allowed TO DBC.
3792 Conflicting CASESPECIFIC options.
3809 Column '%VSTR' is ambiguous.
3811 Column '%VSTR' is NOT NULL. Give the column a value.
3813 The positional assignment list has too many values.
3832 A 2PC parcel must not directly follow an Options parcel.
3851 Macro is greater than %VSTR after name expansion.
3853 '%VSTR' is not a table.
3872 Illegal use of an aggregate operation in an UPDATE statement.
3874 No journaling options have been specified.
3891 Check text for check constraint (explicit or derived from partitioning expression) is greater than 8192 characters.
3912 Invalid message kind received.
3914 ABORT request transaction ids do not match.
3933 The Maximum Possible Row Length in the Table is too Large.
3935 Session and/or request number does not match an aborted task.
3954 Non-GRAPHIC data cannot be assigned to column %VSTR.
3956 Improper representation of hexadecimal GRAPHIC data.
3973 Cannot specify multiple constraints in an Alter Table statement.
3975 Table '%VSTR' is an illegal referential constraint target.
3994 Alias name defined should be used instead of table name%VSTR.
3996 Right truncation of string data.
5014 A CI with no DBDs has been found.
5016 The table header passed in memory is invalid.
5033 An invalid first sector number has been detected.
5035 An unexpected FSG error has occurred.
5054 An invalid SRD reference array index was detected.
5056 An unfree (uninitialized) cylinder has been allocated.
5073 The DB returned by IdxUpgradeDB is not the original one.
5075 An invalid DBKOPERATION was specified.
5077 An invalid deltasectors was specified.
5096 Sort attempted to generate an invalid row uniq value.
5115 An attempt was made to generate an invalid row uniq.
5117 dbkmoddbs could not perform the block split indicated by dbkfit.
5136 The CI was unlocked to avoid an undetectable deadlock.
5138 An unexpected error has occurred in a UNIX service routine.
5157 TID: %d %d %d First RID: %d %d %d %d %d Last Partition/Hash: %d %d %d\n
5159 System test forced tpareset\n
5176 The data block checksum is invalid.
5178 The file information block checksum is ' 'invalid.
5300 Foreign Key column '%VSTR' cannot reference itself.
5319 The specified DATABLOCKSIZE value is invalid
5321 Constraint, primary index, or secondary index with the same name '%VSTR' already exists in table.
5338 More than 1,000 volatile tables in current session.
5340 Database name, if specified, must be the login user name for a volatile table.
5342 Table option not allowed for temporary/volatile table.
5361 The compress value must be in the repertoire of the character column.
5363 The default value must be in the repertoire of the character column.
5380 Internal error: Unexpected character Data type.
5382 The word APPLY is not spelled correctly in the Release Mload statement.
5401 Number of fractional digits out of range.
5403 WITH TIME ZONE must follow either TIME or TIMESTAMP.
5420 No triggers exist for table '%VSTR'.
5422 A triggered action statement cannot update the subject table.
5424 A triggered action statement is the wrong kind.
5441 Replicated table status cannot be INITIATED for Define Role statement.
5443 The specified role of a replicated table is either missing or invalid.
5462 No DDL statements will be allowed until MDS is ready to accept them.
5464 Error in Join Index DDL, %VSTR.
5483 Illegal use of Ordered Analytical Function operation in an UPDATE statement.
5485 MLinReg can have only one entry in the order specification list.
5502 The name specified as an argument does not match the corresponding parameter name '%VSTR'.
5523 Invalid MULTITSR parcel received.
5525 LOCKING modifier canot be specified with CALL SQL statement.
5544 The OPTCOST GDO is uninitialized.
5546 Sampleid cannot be nested inside Ordered Analytical and AGGREGATE Functions.
5563 The table specified in the UPSERT statement contains triggers that could be fired.
5565 The UPDATE specified in the UPSERT statement is a complex update.
5584 Data exception: numeric value out of range.
5586 Random not allowed in inserted PI column value.
5605 A NEW_TABLE or OLD_TABLE reference is not allowed to be passed to a macro.
5607 Check output for possible warnings encountered in compiling UDF.
5624 The Branch offset is very long.
5628 Column %FSTR not found in %VSTR.
5645 The value of '%VSTR' supplied is out of range.
5647 Internal erorr: Data dictionary table UDFInfo in inconsistent state.
5666 LOBs are not allowed in indexes.
5687 REFERENTIAL INTEGRITY is not allowed on LOB columns.
5689 LOBs are not allowed to be compared.
5706 Cannot BEGIN QUERY LOGGING for user because ALL rule exists.
5708 Table header size exceeds limit.
5727 Missing the last range end.
5729 Primary index for the table is not partitioned.
5769 The Locator is invalid due to an internal error.
5788 %VSTR currently do not support Identity Columns.
5790 UPSERT on a table with an Identity Column primary index is illegal.
5809 SET keyword in ANSI UPDATE is required.
5811 MINUS of queries is not in Entry Level ANSI.
5828 PRIVILEGES keyword is required in ANSI.
5830 ROLLBACK WORK cannot indicate a condition or return text to user in ANSI.
5832 LOCKING modifier is not ANSI.
5849 INSTEAD OF triggers are not supported.
5851 BEFORE triggers having data changing statements are not supported.
6013 Vproc Number is not defined in the Physical Configuration.
6015 The host number is invalid.
6032 Warning: The number of Amps in a cluster is less than one-half of the Default Cluster value specified.
6034 Incompatible operation with the Vproc type in current config map.
6036 The range is invalid.
6053 Unable to rebuild reference index, needs to be rebuilt.
6114 Message Kind is invalid in the message header.
6135 Reconfig is aborted at a reversible state.
6137 The hash bucket arrays on an AMP are invalid.
6156 Encountered disk error, causing too many Amps down in cluster
6158 Recovery is still active, Reconfig will not run until recovery done
6604 A changed data row was received that whose contents are in error
6606 Not enough space was provided to receive the data into
6665 RSG failed to locate the TD AMP Worker Task's well-known mailbox
6667 Contents of RSG configuration file are incorrect
6705 An illegally formed character string was encountered during translation.
6707 Internal error: A move of overlapping strings was requested of the ISF subsystem.
6766 Syntax error in SDF.
6768 Invalid numeric format.
6770 Invalid CCD value.
7420 An invalid hash bucket has been encountered.
7439 Row references non-existent session.
7441 The date does not correspond to an exiting era.
7443 The length of argument is incorrect.
7460 Cannot continue utility operation on %DBID.%TVMID because system configuration has changed.
7462 Operation not allowed: pending replication copy operation on table %DBID.%TVMID.
7481 Operation with same utility Id and table Id already running.
7483 Warning: RI checks bypassed for Referential Constraints.
7502 A system trap was caused by UDF %DBID.%TVMID for %VSTR
7504 in UDF %DBID.%TVMID: SQLSTATE %VSTR
7521 Primary LOB row is on wrong AMP.
7523 Fallback LOB row is in wrong subtable.
7544 Cannot logon utility session.
7563 Source row is deleted while update is in progress.
7565 One or more rows found for table not in DBC.TVM. Deleted.
7624 %FSTR: Invalid message class (%VSTR) encountered.
7626 %FSTR: Invalid (%VSTR) step.
7643 %FSTR: Error Initiating a new task; tskget failed with return code (%VSTR).
7645 %FSTR: Error writing into file; fwrite failed with return code (%VSTR).
7704 The RepPhase1 message indicates fewer RepDDL than already received.
7706 Multiple RepBegTrans received for the same transaction before RepPhase1.
7708 Transaction previously asynchronously aborted. Now transaction aborted with RepPhase1Rsp message.
7725 MDS received more DDL chunks than expected for a transaction.
8011 The LAN message Format field is invalid.
8013 The LAN message MessageLength field is invalid.
8051 No network protocol exist.
8112 Secondary index is not unique.
8114 Update is illegal.
8215 ROLLBACK recovery stopped.
8217 Maximum number of tables exceeded.
8234 Duplicate rows will not be restored to table
8236 Cross Platform Migration is currently not supported.
8257 Improperly formatted directed request.
8501 LOB end of data encountered.
8503 Invalid LOB ID
8522 LOT could not allocate a PDE mailbox
8541 LOT send was passed conflicting values in the descriptor
8543 LOT got an unexpected return code discarding a scratch segment
--> COLLECT STATS, STATISTICS COLLECTING
COLLECT STATS
To check stats required for Query. Execute query after below command.
DIAGNOSTIC HELPSTATS ON FOR SESSION;
Statistical information is vital for the optimizer when query plans are built. But collecting statistics can involve time and resources. By understanding and combining several different statistics gathering techniques, users of Teradata can find the correct balance between good query plans and the time required to ensure adequate statistical information is always available.
Collect Full Statistics
Non-indexed columns used in predicates
All NUSIs with an uneven distribution of values *
NUSIs used in join steps
USIs/UPIs if used in non-equality predicates (range constraints)
Most NUPIs (see below for a fuller discussion of NUPI statistic collection)
Full statistics always need to be collected on relevant columns and indexes on small tables (less than 100 rows per AMP)
Can Rely on Random AMP Sampling
USIs or UPIs if only used with equality predicates
NUSIs with an even distribution of values
NUPIs that display even distribution, and if used for joining, conform to assumed uniqueness (see Point #2 under “Other Considerations” below)
See “Other Considerations” for additional points related to random AMP sampling
Option to use USING SAMPLE
Unique index columns
Nearly-unique columns or indexes**
Collect Multicolumn Statistics
Groups of columns that often appear together in conditions with equality predicates, if the first 16 bytes of the concatenated column values are sufficiently distinct. These statistics will be used for single-table estimates.
Groups of columns used for joins or aggregations, where there is either a dependency or some degree of correlation among them.*** With no multicolumn statistics collected, the optimizer assumes complete independence among the column values. The more that the combination of actual values are correlated, the greater the value of collecting multicolumn statistics will be.
Other Considerations
1. Optimizations such as nested join, partial GROUP BY, and dynamic partition elimination will not be chosen unless statistics have been collected on the relevant columns.
2. NUPIs that are used in join steps in the absence of collected statistics are assumed to be 75% unique, and the number of distinct values in the table is derived from that. A NUPI that is far off from being 75% unique (for example, it’s 90% unique, or on the other side, it’s 60% unique or less) will benefit from having statistics collected, including a NUPI composed of multiple columns regardless of the length of the concatenated values. However, if it is close to being 75% unique, then random AMP samples are adequate. To determine what the uniqueness of a NUPI is before collecting statistics, you can issue this SQL statement:
code
EXPLAIN SELECT DISTINCT nupi-column FROM table;
3. For a partitioned primary index table, it is recommended that you always collect statistics on:
PARTITION. This tells the optimizer how many partitions are empty, and how many rows are in each partition. This statistic is used for optimizer costing.
The partitioning column. This provides cardinality estimates to the optimizer when the partitioning column is part of a query’s selection criteria.
4. For a partitioned primary index table, consider collecting these statistics if the partitioning column is not part of the table’s primary index (PI):
(PARTITION, PI). This statistic is most important when a given PI value may exist in multiple partitions, and can be skipped if a PI value only goes to one partition. It provides the optimizer with the distribution of primary index values across the partitions. It helps in costing the sliding-window and rowkey-based merge join, as well as dynamic partition elimination.
(PARTITION, PI, partitioning column). This statistic provides the combined number of distinct values for the combination of PI and partitioning columns after partition elimination. It is used in rowkey join costing.
5. Random AMP sampling has the option of pulling samples from all AMPs, rather than from a single AMP (the default). All-AMP random AMP sampling has these particular advantages:
It provides a more accurate row count estimate for a table with a NUPI. This benefit becomes important when NUPI statistics have not been collected (as might be the case if the table is extraordinarily large), and the NUPI has an uneven distribution of values.
Statistics extrapolation for any column in a table will not be attempted for small tables or tables whose primary index is skewed (based on full statistics having been collected on the PI), unless all-AMP random AMP sampling is turned on. Because a random AMP sample is compared against the table row count in the histogram as the first step in the extrapolation process, an accurate random AMP sample row count is critical for determining if collected statistics are stale, or not.
* Uneven distribution exists when the High Mode Frequency (ModeFrequency column in interval zero) in the histogram is greater than the average rows-per-value (RPV) by a factor of 4 or more. RPV is calculated as Number of Rows / Number of Uniques.
** Any column which is over 95% unique is considered as a neary-unique column.
*** Correlated columns within a multicolumn statistic are columns where the value in one may influence, or predict the values in the second. For example in a nation table, there is a tight correlation between nationkey and nationname. In a customer table there might be a correlation, but a somewhat looser correlation, between customer zip code and customer income band.
Dependent columns are columns where the value in the one column will tend to directly influence the value in the second column. An example of a dependent column could be customer zip code, which is dependent on the customer state. If they both appeared in a multicolumn statistic they would be a dependency between them. Other columns where there is some dependency might be job title which is sometimes dependent on the industry segment, if they both were in a multicolumn stat.
To check stats required for Query. Execute query after below command.
DIAGNOSTIC HELPSTATS ON FOR SESSION;
Statistical information is vital for the optimizer when query plans are built. But collecting statistics can involve time and resources. By understanding and combining several different statistics gathering techniques, users of Teradata can find the correct balance between good query plans and the time required to ensure adequate statistical information is always available.
Collect Full Statistics
Non-indexed columns used in predicates
All NUSIs with an uneven distribution of values *
NUSIs used in join steps
USIs/UPIs if used in non-equality predicates (range constraints)
Most NUPIs (see below for a fuller discussion of NUPI statistic collection)
Full statistics always need to be collected on relevant columns and indexes on small tables (less than 100 rows per AMP)
Can Rely on Random AMP Sampling
USIs or UPIs if only used with equality predicates
NUSIs with an even distribution of values
NUPIs that display even distribution, and if used for joining, conform to assumed uniqueness (see Point #2 under “Other Considerations” below)
See “Other Considerations” for additional points related to random AMP sampling
Option to use USING SAMPLE
Unique index columns
Nearly-unique columns or indexes**
Collect Multicolumn Statistics
Groups of columns that often appear together in conditions with equality predicates, if the first 16 bytes of the concatenated column values are sufficiently distinct. These statistics will be used for single-table estimates.
Groups of columns used for joins or aggregations, where there is either a dependency or some degree of correlation among them.*** With no multicolumn statistics collected, the optimizer assumes complete independence among the column values. The more that the combination of actual values are correlated, the greater the value of collecting multicolumn statistics will be.
Other Considerations
1. Optimizations such as nested join, partial GROUP BY, and dynamic partition elimination will not be chosen unless statistics have been collected on the relevant columns.
2. NUPIs that are used in join steps in the absence of collected statistics are assumed to be 75% unique, and the number of distinct values in the table is derived from that. A NUPI that is far off from being 75% unique (for example, it’s 90% unique, or on the other side, it’s 60% unique or less) will benefit from having statistics collected, including a NUPI composed of multiple columns regardless of the length of the concatenated values. However, if it is close to being 75% unique, then random AMP samples are adequate. To determine what the uniqueness of a NUPI is before collecting statistics, you can issue this SQL statement:
code
EXPLAIN SELECT DISTINCT nupi-column FROM table;
3. For a partitioned primary index table, it is recommended that you always collect statistics on:
PARTITION. This tells the optimizer how many partitions are empty, and how many rows are in each partition. This statistic is used for optimizer costing.
The partitioning column. This provides cardinality estimates to the optimizer when the partitioning column is part of a query’s selection criteria.
4. For a partitioned primary index table, consider collecting these statistics if the partitioning column is not part of the table’s primary index (PI):
(PARTITION, PI). This statistic is most important when a given PI value may exist in multiple partitions, and can be skipped if a PI value only goes to one partition. It provides the optimizer with the distribution of primary index values across the partitions. It helps in costing the sliding-window and rowkey-based merge join, as well as dynamic partition elimination.
(PARTITION, PI, partitioning column). This statistic provides the combined number of distinct values for the combination of PI and partitioning columns after partition elimination. It is used in rowkey join costing.
5. Random AMP sampling has the option of pulling samples from all AMPs, rather than from a single AMP (the default). All-AMP random AMP sampling has these particular advantages:
It provides a more accurate row count estimate for a table with a NUPI. This benefit becomes important when NUPI statistics have not been collected (as might be the case if the table is extraordinarily large), and the NUPI has an uneven distribution of values.
Statistics extrapolation for any column in a table will not be attempted for small tables or tables whose primary index is skewed (based on full statistics having been collected on the PI), unless all-AMP random AMP sampling is turned on. Because a random AMP sample is compared against the table row count in the histogram as the first step in the extrapolation process, an accurate random AMP sample row count is critical for determining if collected statistics are stale, or not.
* Uneven distribution exists when the High Mode Frequency (ModeFrequency column in interval zero) in the histogram is greater than the average rows-per-value (RPV) by a factor of 4 or more. RPV is calculated as Number of Rows / Number of Uniques.
** Any column which is over 95% unique is considered as a neary-unique column.
*** Correlated columns within a multicolumn statistic are columns where the value in one may influence, or predict the values in the second. For example in a nation table, there is a tight correlation between nationkey and nationname. In a customer table there might be a correlation, but a somewhat looser correlation, between customer zip code and customer income band.
Dependent columns are columns where the value in the one column will tend to directly influence the value in the second column. An example of a dependent column could be customer zip code, which is dependent on the customer state. If they both appeared in a multicolumn statistic they would be a dependency between them. Other columns where there is some dependency might be job title which is sometimes dependent on the industry segment, if they both were in a multicolumn stat.
--> Teradata Architecture-PE,BYNET,AMP,VDISK,PDISK
Introduction to Teradata RDBMS
Teradata RDBMS is a complete relational database management system. The system is based on
off-the-shelf Symmetric Multiprocessing (SMP) technology combined with a communication
network connecting the SMP systems to form a Massively Parallel Processing (MMP) system.
BYNET is a hardware inter-processor network to link SMP nodes. All processors in a same SMP
node are connected by a virtual BYNET. We use the following figure to explain how each
component in this DBMS works together.
PDE (Parallel Database Extensions):
This component is an interface layer on the top of operating system. Its functions
include: executing vprocs (virtualprocessors), providing a parallel environment,
scheduling sessions, debugging, etc.
Teradata File System:
It allows Teradata RDBMS to store and retrieve data regardless of low-level operating system interface.
PE (Parsing Engine):
Communicate with client
Manage sessions
Parse SQL statements
Communicate with AMPs
Return result to the client
AMP (Access Module Processor):
BYNET interface
Manage database
Interface to disk subsystem
CLI (Call Level Interface):
A SQL query is submitted and transferred in CLI packet format
TDP (Teradata Director Program):
Route the packets to the specified Teradata RDBMS server
Teradata RDBMS has the following components that support all data communication
management:
_ Call Level Interface ( CLI )
_ WinCLI & ODBC
_ Teradata Director Program ( TDP for channel attached client )
_ Micro TDP ( TDP for network attached client )
Node hardware and software components
CPUs are not physically associated with vprocs. Performance is best when you use the UNIX affinity scheduler to keep a logical association between a CPU and a vproc.
Memory - Vprocs share a free memory pool within a node. A segment of memory is allocated to a vproc for use, then returned to the memory pool for use by another vproc.
MCA - Slots in the MCA ( Micro Channel Adapter) are used for the following connections:
Local Peripheral Board (LPB)
External disk arrays
LAN connections
Mainframe channel connections
MCCA - MCCA boards (Micro Channel Cable Adapter) enable communication between a channel-attached node and the Tailgate box. MCCA boards are located in MCA slots.
Ethernet Card - Each LAN connection to a node requires an Ethernet card, which communicates with the Teradata Gateway software. Ethernet cards are located in MCA slots.
Twisted Pair Shielded Cable - Connects the MCCA card to the Tailgate box for a mainframe channel connection.
LAN Cable - Connect the Ethernet cards in the MCA to the LAN.
Tailgate Box - An adapter between the node cabinet and the mainframe in a channel-connected system.
Bus and Tag Cables - Connects the Tailgate box to the mainframe.
Virtual Disk(vdisk) - The logical disk that is managed by an AMP. Each AMP is associated with a single disk.
UNIX - The Teradata RDBMS is built on the UNIX operating system for an open environment. NCR added MP-RAS extensions to UNIX to facilitate a multiple CPU environment.
Parallel Database Extensions (PDE) - Software that runs on UNIX MP-RAS. It was created by NCR to support the parallel environment.
Trusted Parallel Application (TPA) - Implements virtual processors and runs on the foundation of UNIX MP-RAS and PDE.
The Teradata RDBMS for UNIX is classified as a TPA.
Access Module Processors (AMP) are vprocs that receive steps from PEs and perform database functions to return of update data. Each AMP is associated with one vdisk.
PE - Vprocs that create SQL requests from the client and break the requests into steps. The PEs send the steps to the AMPs and subsequently return the answer to the client.
Teradata Gateway - Software that communicates between the PEs and applications running on LAN-attached clients and a node in the system. The Teradata Gateway has a session limit of 600 sessions.
Channel Driver - Software that communicates between the PEs and applications running on channel-attached clients.
Platforms
Single Node System: All of the node components together comprise a node. A single node system is typically implemented on an SMP platform. The vprocs in an SMP system communicate over the vnet.
Nodes working together create a multiple-node Teradata RDBMS system, which is implemented on an MPP platform. The nodes and vprocs communicate over the BYNET (Banyan Network).
BYNET
The BYNET is a high-speed interconnect that is responsible for:
Sending messages
Merging data
Sorting answers
The BYNET messaging capability enables vprocs to send different types of messages:
Point-to-Point - A vproc can send a message to another vproc:
In the same node using BYNET software only, the message is reassigned in memory to the target vproc.
In another node the message is using both BYNET hardware and software.
Multicast - A vproc can send a message to multiple vprocs by sending a broadcast message to all nodes. The BYNET software on the receiving node determines whether a vproc on the node should receive or discard the message.
Broadcast - A vproc can broadcast a message to all the vprocs in the system.
Two BYNETs per system for the following reasons:
Performance
Fault Tolerance
Clique
A clique is a group of nodes that share access to the same disk arrays. The nodes have a daisy-chain connection to each disk array controller.
Cliques provide data accessibility if a node fails for any reason (i.e. UNIX reset).
Vprocs are distributed across all nodes in the system. Each multi-node system has at least one clique.
Software Components
UNIX operating system - The Teradata RDBMS runs on UNIX SVR4 with MP-RAS.
Parallel Database Extensions (PDE) - PDE was added to the UNIX kernel by NCR to support the parallel software environment.
Trusted Parallel Application (TPA) - A TPA uses PDE to implement virtual processors. The Teradata RDBMS is classified as a TPA.
Channel Driver - The Channel Driver software is the means of communication between the application and the PEs assigned to channel-attached clients.
Teradata Gateway - The Gateway software is the means of communication between the application and the PEs assigned to network-attached clients. There is one Gateway per node.
AMP - The AMP is a type of vproc that has software to manage data.
AMP Worker Task (AWT) Functions in the AMP perform a number of operations, including:
Locking Tables
Executing Tables
Joining Tables
Executing end transaction steps
The file system software accesses the data on the virtual disks. Each AMP uses the file system software to read from and write to the virtual disks.
Console Utilities - The AMP software includes utilities to perform generally sophisticated, low-level functions such as:
Configure and reconfigure the system
Rebuild tables
Reveal details about locks and space status
PE - a PE is a type of vproc that has software components to break SQL into steps, and send the steps to the AMPs.
Session Control - When you log on to the teradata RDBMS through your application, the session control software on the PE establishes that session. Session control also manages and terminates sessions on the PE.
Parser/Optimizer - The parser interprets your Teradata SQL request and checks the syntax. The parser decomposes the request into AMP steps, using the optimizer to determine the most efficient way to access the data on the virtual disks. Then the parser sends the steps to the dispatcher.
Dispatcher - The dispatcher is responsible for a number of tasks, depending on the operation it is performing:
Processing Requests
Processing Responses
Teradata RDBMS is a complete relational database management system. The system is based on
off-the-shelf Symmetric Multiprocessing (SMP) technology combined with a communication
network connecting the SMP systems to form a Massively Parallel Processing (MMP) system.
BYNET is a hardware inter-processor network to link SMP nodes. All processors in a same SMP
node are connected by a virtual BYNET. We use the following figure to explain how each
component in this DBMS works together.
PDE (Parallel Database Extensions):
This component is an interface layer on the top of operating system. Its functions
include: executing vprocs (virtualprocessors), providing a parallel environment,
scheduling sessions, debugging, etc.
Teradata File System:
It allows Teradata RDBMS to store and retrieve data regardless of low-level operating system interface.
PE (Parsing Engine):
Communicate with client
Manage sessions
Parse SQL statements
Communicate with AMPs
Return result to the client
AMP (Access Module Processor):
BYNET interface
Manage database
Interface to disk subsystem
CLI (Call Level Interface):
A SQL query is submitted and transferred in CLI packet format
TDP (Teradata Director Program):
Route the packets to the specified Teradata RDBMS server
Teradata RDBMS has the following components that support all data communication
management:
_ Call Level Interface ( CLI )
_ WinCLI & ODBC
_ Teradata Director Program ( TDP for channel attached client )
_ Micro TDP ( TDP for network attached client )
Node hardware and software components
CPUs are not physically associated with vprocs. Performance is best when you use the UNIX affinity scheduler to keep a logical association between a CPU and a vproc.
Memory - Vprocs share a free memory pool within a node. A segment of memory is allocated to a vproc for use, then returned to the memory pool for use by another vproc.
MCA - Slots in the MCA ( Micro Channel Adapter) are used for the following connections:
Local Peripheral Board (LPB)
External disk arrays
LAN connections
Mainframe channel connections
MCCA - MCCA boards (Micro Channel Cable Adapter) enable communication between a channel-attached node and the Tailgate box. MCCA boards are located in MCA slots.
Ethernet Card - Each LAN connection to a node requires an Ethernet card, which communicates with the Teradata Gateway software. Ethernet cards are located in MCA slots.
Twisted Pair Shielded Cable - Connects the MCCA card to the Tailgate box for a mainframe channel connection.
LAN Cable - Connect the Ethernet cards in the MCA to the LAN.
Tailgate Box - An adapter between the node cabinet and the mainframe in a channel-connected system.
Bus and Tag Cables - Connects the Tailgate box to the mainframe.
Virtual Disk(vdisk) - The logical disk that is managed by an AMP. Each AMP is associated with a single disk.
UNIX - The Teradata RDBMS is built on the UNIX operating system for an open environment. NCR added MP-RAS extensions to UNIX to facilitate a multiple CPU environment.
Parallel Database Extensions (PDE) - Software that runs on UNIX MP-RAS. It was created by NCR to support the parallel environment.
Trusted Parallel Application (TPA) - Implements virtual processors and runs on the foundation of UNIX MP-RAS and PDE.
The Teradata RDBMS for UNIX is classified as a TPA.
Access Module Processors (AMP) are vprocs that receive steps from PEs and perform database functions to return of update data. Each AMP is associated with one vdisk.
PE - Vprocs that create SQL requests from the client and break the requests into steps. The PEs send the steps to the AMPs and subsequently return the answer to the client.
Teradata Gateway - Software that communicates between the PEs and applications running on LAN-attached clients and a node in the system. The Teradata Gateway has a session limit of 600 sessions.
Channel Driver - Software that communicates between the PEs and applications running on channel-attached clients.
Platforms
Single Node System: All of the node components together comprise a node. A single node system is typically implemented on an SMP platform. The vprocs in an SMP system communicate over the vnet.
Nodes working together create a multiple-node Teradata RDBMS system, which is implemented on an MPP platform. The nodes and vprocs communicate over the BYNET (Banyan Network).
BYNET
The BYNET is a high-speed interconnect that is responsible for:
Sending messages
Merging data
Sorting answers
The BYNET messaging capability enables vprocs to send different types of messages:
Point-to-Point - A vproc can send a message to another vproc:
In the same node using BYNET software only, the message is reassigned in memory to the target vproc.
In another node the message is using both BYNET hardware and software.
Multicast - A vproc can send a message to multiple vprocs by sending a broadcast message to all nodes. The BYNET software on the receiving node determines whether a vproc on the node should receive or discard the message.
Broadcast - A vproc can broadcast a message to all the vprocs in the system.
Two BYNETs per system for the following reasons:
Performance
Fault Tolerance
Clique
A clique is a group of nodes that share access to the same disk arrays. The nodes have a daisy-chain connection to each disk array controller.
Cliques provide data accessibility if a node fails for any reason (i.e. UNIX reset).
Vprocs are distributed across all nodes in the system. Each multi-node system has at least one clique.
Software Components
UNIX operating system - The Teradata RDBMS runs on UNIX SVR4 with MP-RAS.
Parallel Database Extensions (PDE) - PDE was added to the UNIX kernel by NCR to support the parallel software environment.
Trusted Parallel Application (TPA) - A TPA uses PDE to implement virtual processors. The Teradata RDBMS is classified as a TPA.
Channel Driver - The Channel Driver software is the means of communication between the application and the PEs assigned to channel-attached clients.
Teradata Gateway - The Gateway software is the means of communication between the application and the PEs assigned to network-attached clients. There is one Gateway per node.
AMP - The AMP is a type of vproc that has software to manage data.
AMP Worker Task (AWT) Functions in the AMP perform a number of operations, including:
Locking Tables
Executing Tables
Joining Tables
Executing end transaction steps
The file system software accesses the data on the virtual disks. Each AMP uses the file system software to read from and write to the virtual disks.
Console Utilities - The AMP software includes utilities to perform generally sophisticated, low-level functions such as:
Configure and reconfigure the system
Rebuild tables
Reveal details about locks and space status
PE - a PE is a type of vproc that has software components to break SQL into steps, and send the steps to the AMPs.
Session Control - When you log on to the teradata RDBMS through your application, the session control software on the PE establishes that session. Session control also manages and terminates sessions on the PE.
Parser/Optimizer - The parser interprets your Teradata SQL request and checks the syntax. The parser decomposes the request into AMP steps, using the optimizer to determine the most efficient way to access the data on the virtual disks. Then the parser sends the steps to the dispatcher.
Dispatcher - The dispatcher is responsible for a number of tasks, depending on the operation it is performing:
Processing Requests
Processing Responses
--> SQL STATEMENT EXICUTION IN TERADATA
FLOW OF SQL STATEMENT
A user generates an SQL query on the channel attached client. The query can either be from a BTEQ session at an interactive terminal, from a compatible fourth generation language, or can originate from within an application program coded in a host language.
The CLI request handler packages the request and sends it to the Teradata Director Program (TDP) for routing to the server.
The TDP establishes a session, then routes the request across the communications channel to the parsing engine (PE).
The parser component of the PE opens the request package and parses the SPL code for processing, interprets it, checks its syntax, and optimizes the access plan.
Without errors - The parser decodes the request into a series of work steps and passes them to the dispatcher.
With errors - The dispatcher receives the appropriate error message and returns it to the requester. Processing terminates.
The dispatcher sequences the steps and passes them on to the BYNET with instructions about whether the steps are for one Access Module Process (AMP), an AMP group, or for all AMPs.
The BYNET (or virtual BYNET on a single node system) distruibutes the execution steps to the appropriate AMP for processing.
The AMPs process the execution steps by performing operations on the database. The AMPs make these operations by making calls to the file system.
The file system performs primitive physical data block operations by locating the data blocks to be manipulated and then passing contol to the disk subsystem.
The disk subsystem retrieves the requested blocks for the file system.
The disk manager returns the requested blocks to the file system.
The file system returns the requested data to the database manager.
The database manager sends a message back to the dispatcher stating that the data is ready to be returned to the requesting user, then sorts and transmits the data to the interface engine over the BYNET.
The BYNET merges the sorted response and returns it to the requestion interface engine for packaging.
The dispatcher builds the response message and routes it to the requesting client system.
The TDP receives and unpacks the response messages and makes them available to the CLI.
CLI passes the received data back to the requesting application in blocks.
The requesting application receives the response data in the form of a relational table.
A user generates an SQL query on the channel attached client. The query can either be from a BTEQ session at an interactive terminal, from a compatible fourth generation language, or can originate from within an application program coded in a host language.
The CLI request handler packages the request and sends it to the Teradata Director Program (TDP) for routing to the server.
The TDP establishes a session, then routes the request across the communications channel to the parsing engine (PE).
The parser component of the PE opens the request package and parses the SPL code for processing, interprets it, checks its syntax, and optimizes the access plan.
Without errors - The parser decodes the request into a series of work steps and passes them to the dispatcher.
With errors - The dispatcher receives the appropriate error message and returns it to the requester. Processing terminates.
The dispatcher sequences the steps and passes them on to the BYNET with instructions about whether the steps are for one Access Module Process (AMP), an AMP group, or for all AMPs.
The BYNET (or virtual BYNET on a single node system) distruibutes the execution steps to the appropriate AMP for processing.
The AMPs process the execution steps by performing operations on the database. The AMPs make these operations by making calls to the file system.
The file system performs primitive physical data block operations by locating the data blocks to be manipulated and then passing contol to the disk subsystem.
The disk subsystem retrieves the requested blocks for the file system.
The disk manager returns the requested blocks to the file system.
The file system returns the requested data to the database manager.
The database manager sends a message back to the dispatcher stating that the data is ready to be returned to the requesting user, then sorts and transmits the data to the interface engine over the BYNET.
The BYNET merges the sorted response and returns it to the requestion interface engine for packaging.
The dispatcher builds the response message and routes it to the requesting client system.
The TDP receives and unpacks the response messages and makes them available to the CLI.
CLI passes the received data back to the requesting application in blocks.
The requesting application receives the response data in the form of a relational table.
What is difference B/w User and database in Teradata?
- User is a database with password but database cannot have password
- Both can contain Tables , views and macros
- Both users and databases may or may not hold privileges
- Only users can login, establish a session with Teradata database and they can submit requests
- Both can contain Tables , views and macros
- Both users and databases may or may not hold privileges
- Only users can login, establish a session with Teradata database and they can submit requests
How do you create materialized view in Teradata?
There is no such thing as a "materialized view" in Teradata. The equivalent in Teradata would be a join index (or aggregate index) with a corresponding view put on top of it. The command to create one is "CREATE JOIN INDEX...(lots of options)".
Join indices are maintained when the data is maintained. They cannot be accessed directly, but are invoked when the optimizer determines it is beneficial. Even though you cannot invoke a join index directly, you create a view that looks like the join index and in that way, it would be similar to having direct access. However, this approach does not guarantee that the join index will be used.
Join indices are maintained when the data is maintained. They cannot be accessed directly, but are invoked when the optimizer determines it is beneficial. Even though you cannot invoke a join index directly, you create a view that looks like the join index and in that way, it would be similar to having direct access. However, this approach does not guarantee that the join index will be used.
What are Differences between Teradata and ANSI Session modes in Teradata?
TERADATA MODE
-------------
1. Comparison is not Case sensitive
2. Create table are default to SET tables
3. Each transaction is committed implicitly
4. Supports all Teradata commands
5. It follows BTET (Begin and End Transaction) Mode
ANSI MODE
-------------
1. Comparison is CASE sensitive
2. Create table are default to MULTISET tables
3. Each transaction has to be committed explicitly
4. Does not support all Teradata commands
5. It does not follow BTET Mode
-------------
1. Comparison is not Case sensitive
2. Create table are default to SET tables
3. Each transaction is committed implicitly
4. Supports all Teradata commands
5. It follows BTET (Begin and End Transaction) Mode
ANSI MODE
-------------
1. Comparison is CASE sensitive
2. Create table are default to MULTISET tables
3. Each transaction has to be committed explicitly
4. Does not support all Teradata commands
5. It does not follow BTET Mode
--> What are the scenarios in which Full Table Scans occurs?
1. The where clause in SELECT statement does not use either primary index or secondary index
2. SQL Statement which uses a partial value (like or not like), in the WHERE statement.
3. SQL Statement which does not contain where clause.
4. SQL statement using range in where clause. Ex. (col1 > 40 or col1 < =10000
2. SQL Statement which uses a partial value (like or not like), in the WHERE statement.
3. SQL Statement which does not contain where clause.
4. SQL statement using range in where clause. Ex. (col1 > 40 or col1 < =10000
--> How to identify PPI columns?
Select databasename , tablename , columnposition ,columnname from dbc.indices
where indextype ='Q'
order by 1 ,2,3 ;
where indextype ='Q'
order by 1 ,2,3 ;
--> How to find out list of indexes in Teradata?
SELECT
databasename, tablename, columnname, indextype, indexnumber, indexname
FROM
dbc.indices
ORDER BY
databasename, tablename, indexnumber;
IndexType Description
P Nonpartitioned Primary
Q Partitioned Primary
S Secondary
J join index
N hash index
K primary key
U unique constraint
V value ordered secondary
H hash ordered ALL covering secondary
O valued ordered ALL covering secondary
I ordering column of a composite secondary index
M Multi column statistics
D Derived column partition statistics
1 field1 column of a join or hash index
2 field2 column of a join or hash index
--> Explain types of re-distribution of data happening for joining of columns from two tables Case 1 - P.I = P.I joins Case 2 - P.I = non Index joins Case 3 - non Index = non Index joins
Case1 - there is no redistribution of data over amp's. Since amp local joins happen as data are present in same AMP and need not be re-distributed. These types of joins on unique primary index are very fast.
Case2 - data from second table will be re-distributed on all amps since joins are happening on PI vs. non Index column. Ideal scenario is when small table is redistributed to be joined with large table records on same amp
case3 - data from both the tables are redistributed on all AMPs. This is one of the longest processing queries , Care should be taken to see that stats are collected on these columns
How to skip the header row in the fastload script
RECORD 2; /* this skips first record in the source file */
DEFINE ...
Case2 - data from second table will be re-distributed on all amps since joins are happening on PI vs. non Index column. Ideal scenario is when small table is redistributed to be joined with large table records on same amp
case3 - data from both the tables are redistributed on all AMPs. This is one of the longest processing queries , Care should be taken to see that stats are collected on these columns
How to skip the header row in the fastload script
RECORD 2; /* this skips first record in the source file */
DEFINE ...
--> What is the difference between start schema and Fsldm?
Star Schema
--> Demoralized for query performance
--> A star schema has one Fact tables and many dimensional tables to store the related.
FSLDM (Financial Services Logical Data Model)
--> 3rd normal form
-->Proposed and designed by NCR to cater the needs of financial sectors, especially for the Banking domain.
--> It has many Fact tables and dimension tables.
--> The FSLDM can be customized to cater to bank's specific needs.
Ex. The major fact tables include Party (Customers of the bank), Product (Services offered by the bank), Event (An event occurred ex: An ATM transaction), Agreement (A deal between
the party and the bank for a product) , etc..
Can you load multiple data files for same target table using Fastload?
Answer:
Yes, we can Load a table using multiple datafiles in Fastload.
Before giving "end loading" statement user can define file path and use insert sql for multiple source files and give "end loading" statement at the end
--> Demoralized for query performance
--> A star schema has one Fact tables and many dimensional tables to store the related.
FSLDM (Financial Services Logical Data Model)
--> 3rd normal form
-->Proposed and designed by NCR to cater the needs of financial sectors, especially for the Banking domain.
--> It has many Fact tables and dimension tables.
--> The FSLDM can be customized to cater to bank's specific needs.
Ex. The major fact tables include Party (Customers of the bank), Product (Services offered by the bank), Event (An event occurred ex: An ATM transaction), Agreement (A deal between
the party and the bank for a product) , etc..
Can you load multiple data files for same target table using Fastload?
Answer:
Yes, we can Load a table using multiple datafiles in Fastload.
Before giving "end loading" statement user can define file path and use insert sql for multiple source files and give "end loading" statement at the end
Subscribe to:
Posts (Atom)