Teradata SQL: May 2013

25 May 2013

TPT Utiliti

Teradata Parallel Transporter (TPT) is a flexible, high-performance Data Warehouse loading tool specifically optimized for Teradata Database, which enables data extraction, transformation and loading. It incorporates an infrastructure, which provides a parallel execution environment for product components called “operators”, which integrate with the infrastructure in a "plug-in" fashion and are thus interoperable.

Teradata PT also consists of a separate component, the Teradata PT Application Programmer Interface (abbreviated as TPTAPI), which allows users to invoke some of the Teradata PT operators directly. However, the features presented in this article can only be accomplished through the scripting tool provided by Teradata PT, not the TPTAPI unless users chose to code the features themselves.

Operators provide access to external resources, e.g., files, DBMS tables, Messaging Middleware products, etc., and perform various filtering and transformation functions. The Teradata PT infrastructure incorporates a high performance data transfer mechanism called the Data Stream, which is used for interchanging data between the operators. As shown inFigure 1, a typical Data Stream connects two operators: one of them, known as the “producer”, writes the data to the Data Stream, and another one, known as the “consumer”, reads the data from the Data Stream. Note that operators connected by the Data Stream must share the same schema (i.e., data layout and format) for the data to be interchanged.

Figure 1: Operators and Data Streams

The Teradata PT infrastructure also makes it possible to launch multiple instances of a given operator. Each instance is typically assigned to a system process or thread, and therefore, can execute independently of, and concurrently with, the other instances. To enable workload balancing amongst multiple instances of a consumer operator, the input Data Stream can be split so that each operator instance receives a portion of the overall data. Conversely, output Data Streams produced by multiple instances of a producer operator can be merged to form a single Data Stream, which may also be split at some later point.

Figure 2 shows a typical database loading scenario: multiple instances of the producer operator collect data from multiple sources and write it to their respective output Data Streams. The latter are merged and then split across multiple instances of the consumer operator. It is important to note that in this case all instances of the consumer operator are loading data into a single database target, for example, the same DBMS server.

Figure 2: Multiple operator instances and Data Streams

Figure 2 also shows a set of Teradata PT-provided operators for users to carry out a variety of ETL operations:

• The Load Operator loads data into new tables using a high-performance, multi-session FastLoad protocol.
• The Update Operator loads data and applies updates to both new and existing tables by using the high-performance, multi-session Multiload protocol. The updates can be applied either conditionally or unconditionally based on user-defined rules.
• The Export Operator extracts data from Teradata tables using the high-performance, multi-session FastExport protocol.
• The Stream Operator loads data and applies updates to new and existing tables by using multiple SQL protocol sessions in a high-performance, workload-balancing manner. The updates can be applied either conditionally or unconditionally based on user-defined rules. This operator can be used for loading data from continuous data sources such as queuing systems (e.g., MQ Series), EAI products, etc.
• The SQL Inserter Operator loads data into new and existing tables using a single SQL protocol session.
• The SQL Selector Operator extracts data from existing tables using a single SQL protocol session.
• The ODBC Operator extracts data extracts data from external non-Terdatada ODBC sources. To extract from Teradata Database, use the Export Operator instead.
• The Data Connector Operator allows simultaneous, parallel reading of multiple data sources such as various types of files, queuing systems, etc. It also allows writing to external data sources, such as files.

Teradata PT Capabilities

Teradata PT capabilities can be divided into the following broad categories:

• Scalability and Parallelism
• Usability
• Flexibility

Through these capabilities, Teradata PT addresses many problems frequently encountered when introducing data to the Data Warehouse.

Scalability and Parallelism

Increased Throughput

The use of multiple parallel instances of operators, as well as operator multi-session capability, enable Teradata PT jobs to be scaled-up to maximize the use of available CPU resources on their ETL client hosts. This results in shorter job run times, if either I/O or CPU is the bottleneck from the load tool processing.

Reduced or Eliminated Intermediate File I/O

The use of Data Streams instead of intermediate files, and the application of a single Data Stream to multiple data targets, can greatly reduce the amount of file I/O that would otherwise be required to accomplish many ETL objectives. The use of these features also increases throughput as mentioned above, and reduce or eliminate disk space requirements for intermediate files.

Launching of Multiple Operator Instances

In traditional Teradata utilities, which rely on a single system process to perform data extraction and loading, the single process could reach a threshold beyond which throughput cannot increase. One of the most critical limiting factors in this case is CPU speed and availability.

Teradata PT, due to its parallel and scalable architecture, makes it possible to utilize additional CPU processing power, shorten the load process, and reduce the overall ETL execution time. The number of operator instances is specifiable by the user, affording control over the scalability and performance of the Data Warehouse loading process. In addition, Teradata PT allows data extraction and loading processes to run completely asynchronously to each other. This allows for broader parallelism that improves performance even further.

Usability

SQL-Like Script Language

Teradata PT employs a single script language for specifying ETL operations to be performed by a job. It is based on Teradata Utility Script Languages and ANSI SQL, which emphasizes the goal rather than the method (“what” vs. “how”).

This language allows specification and seamless integration of ETL processes. It also utilizes extensible, reusable metadata objects such as virtual tables and attributes for mapping data sources and targets. This can significantly simplify the creation and administration of ETL processes. Finally, it supports direct control of scalability, which allows users to monitor and adjust performance based on system resources.

Automatic or Manual Restart

Teradata PT Jobs can be directed to take periodic checkpoints, or if periodic checkpoints are not specified, the Teradata PT takes two default checkpoints at the start and the end of data acquisition. If a Teradata PT job is interrupted because of a Teradata Database restart or errors that can be retried (e.g. deadlocks), it automatically restarts from the last checkpoint taken. Jobs interrupted for any other reason can be manually restarted from the last checkpoint taken.

In-Line Filtering and Derived Values

Although Teradata PT can help simplify the data extraction and loading process, in many cases the original data cannot be loaded into the Data Warehouse directly. Instead, it must be pre-processed. The reasons include, but are not limited to, the following:

• The source contains more data than needs to be loaded
• The data needs to be validated
• The data needs to be transformed
• The format of the data is otherwise unsuitable for loading

A special data transformation tool or a user-written program typically performs the required pre-processing. In many cases this may lead to the creation of a separate staging file, rendering the load process more complex and requiring additional I/O.

Teradata PT can perform data modification that can be specified directly in the job script. The specific data modification capabilities are as follows:

• The source data can be filtered via the SQL-compatible WHERE condition. With this facility it is possible to ensure that the extracted data meets certain validation criteria. It is also possible to only select the data that needs to be loaded.
• The source data can be re-arranged using the SQL-compatible SELECT statement. This statement makes it possible to select only the data fields that need to be loaded and ignore the ones that don’t. This can significantly reduce the overall volume of data that is being loaded. Also, the SELECT statement allows specification of “derived values” (i.e., values that are not present in the original data source but instead are determined dynamically) in the process of loading. These derived values can be calculated either conditionally or unconditionally. Please refer to the Teradata Parallel Transporter Reference for more information on performing the above operations.

These modifications are performed in-line (i.e., the data doesn’t need to be landed to an intermediate file), thereby eliminating redundant I/O. The data is sent directly from the data extraction operator to the modification component that forwards it to the loading operator. The entire operation is performed asynchronously.

Flexibility

Seamless Integration of Various Operators within the Same Script

As shown in Figure 2, many of the operators can be seamlessly combined in various ways within the same script to provide solutions to a wide variety of ETL problems. Because operators are modular and reusable, they can be combined and interchanged with only minor adaptations to the script.

The following ETL operations are common:

• Loading data from multiple sources (e.g., multiple files or tape volumes) and into multiple targets (e.g. multiple Teradata databases)
• Combining data from dissimilar sources (e.g., ODBC sources, MQ Series, files with different physical and/or logical formats)
• Loading and updating large amounts of data in a restartable manner
• Transforming the data before loading
• Batch and active data extraction and loading in a single job

Multiple Operational Steps in a Single Job (Job Steps)

A Teradata PT job can contain multiple processing steps. Initialization activities, such as creating target and work tables, are specified at the beginning of the job. Termination activities, such as dropping work tables or handling error tables, are specified at the end of the job. A job can contain as many ETL steps as desired, either to implement complex ETL scenarios or simply to group related ETL processes into a single job.

Export/Extract Feeds Load/Update in a Single Job Step – No Intermediate Files

Data extracted or exported from data sources (by "producer" operators) are fed directly to load and update operators ("consumer" operators) without the use of any intermediate files. This can greatly reduce the amount of file I/O and significantly improve performance due to shorter data path lengths, when compared to the same ETL process accomplished by a combination of the traditional Teradata utilities. When multiple operator instances are specified, producer operators are connected to consumer operators via a multiplexed network of Data Streams that the Teradata PT manages in a load-balancing fashion for maximum efficiency.

Combine Similar and Dissimilar Data Sources in a Single Job Step

While multiple instances of operators can facilitate scalable access to data (e.g. files with the same schema), the Teradata PT’s UNION ALL operation allows the same data source to be accessed in “partitions” using different criteria and specifications for each of the partitions. For example, a relational table can be accessed by the ODBC operator with different “key ranges” so that row sets that satisfy such key ranges can be merged, as long as they have the same schema (i.e. are UNION-compatible).

Teradata PT also allows dissimilar sources to be read in parallel. For example, a Teradata PT job can simultaneously read data from different relational sources or different files from different media (e.g. disks and tapes). This is in contrast with the single-threaded utility, which usually requires additional steps for combining multiple sources into some intermediate file (the so-called “staging file”) before data can be loaded into the target table. Teradata PT makes it possible to avoid creation of staging files and not only simplifies the load process but also makes it faster by eliminating redundant I/O.

Integration of User-Written Modification Components

For those cases where the data modification requirements exceed those that are directly provided by the Teradata PT infrastructure, custom-written modification components can be integrated and invoked from job scripts. These user-written components are called “filter operators” and can contain arbitrary business logic. Please refer to the Teradata PT Operator API for information on how to build filter operators.

Integration of INMODs, OUTMODs, and Access Modules

Some Teradata users have made significant investments in creating customized INMODs, OUTMODs, and Access Modules for use with Teradata utilities. These modules allow users to perform processing that falls outside the capabilities provided by the utilities. Rather than requiring these users to re-implement the business logic contained in such INMODs, OUTMODs, and Access Modules, Teradata PT provides an adapter that makes these modules usable within the Teradata PT framework. This preserves the users’ investments and simplifies the transition to Teradata PT.

Summary

As indicated in previous sections, Teradata PT provides very flexible, high-performance Load and Unload facilities that enable population of the Data Warehouse. Teradata PT improves ETL performance through parallel execution while providing a uniform interface to all its functions. It supports many functions of the traditional Teradata utilities and many new, advanced functions for scalability and parallelization. It also facilitates Active Data Warehousing (ADW) by allowing continuous access to data sources and data targets through the Data Connector and Stream operators, a feature similar to the Teradata TPump utility.

In the upcoming Teradata PT articles, more features will be discussed in more detail and examples of how they are applied in real-world scenarios will be included. All these features not only exploit the power of scalable performance and parallelism for batch-oriented ETL processes, but also lay the foundation for the parallel and scalable real-time transactional updates commonly found in Active Data Warehousing.

24 May 2013

Teradata 13.10 new features (part 3)

Td13.10 not only enhanced existing Multi valued compression but also introduced two new methods

3. Algorithmic compression

Users can now write their own compression and decompression algorithms for characters.

These will have to be implemented in UDF and needs to be tested thoroughly before implementing it in full scale.

There are some samples functions Teradata has provide and can be found in SYSLIB.

TransUnicodeToUTF8, TransUTF8ToUnicode, LZCOMP, LZDECOMP, LZCOMP_L, LZDECOMP_L, CAMSET, DECAMSET, CAMSET_L,DECAMSET_L.

4. Block level compression (applicable to DBA’s, Capacity planners)

BLC is applied at data blocks and allows row data to be stored in datablocks* in a new compressed format.

To enable BLC on tables, new command "COMPRESS" is available in FERRET utility. After BLC is enabled on table, all tows added to table are compressed automatically. Similarly "UNCOMPRESS" command can be used to uncompress packed Data.

BLC is one of most straight forward methods where one doesn’t have to worry about type of data across various columns in table. Row header and internal Teradata structures are also compressed using BLC.

BLC plays major role in saving lot of I/O and Space but at same time it can have some performance implications as more CPU can be utilized to compress and uncompress data dynamically.

Minicylpack in Teradata

MINICYLPACK is File system activity and is handled by Teradata by default.

This activity will exactly free one cylinder in an AMP and it is run automatically when the user request would need free cylinder or huge amount of space.

Cylinders are dedicated for PERM and SPOOL space separately.

MINICYLPACK is kind of last resort in system when the system is completely out of space and it is initiated. The Teradata file system will start to minicylpack when the number of free cylinders drops to the value set by MiniCylPackLowCylProd. The default is 10.

Say in above scenario, an AMP with 6 cylinders, the MiniCylPackLowCylProd is set to 1.

When the number of free cylinders goes below 1 i.e. 0, MINICYLPACKS is initiated and one cylinder is freed up during the process.

MINICYLPACK is resource intense activity because it causes lot of I/O's in process of freeing up the data blocks . It is suggested to run SHOWFSP from Ferret to see number of free cylinders available and cylinders that can be freed up and run PACKDISK to avoid unnecessary MINICYLPACK's

How to start PACKDISK?

please note that We can start /pause/ abort & continue packdisk whenever needed.

(type help to figure more options is you are not sure)

1. To start packdisk, go to ferret-> type PACKDISK

2. you can define range of FSP to start packdisk (say fsp =30 will start for cylinders having less than 30% free space)

3. Above option can be ignored and PACKDISK can be invoked for all cylinders)

SHOWFSP another Ferret utility which is very useful in identifying particular tables that needs to packing.

It also shows amount of cylinders that can freed up by specifying Free space percent (FSP)

By using Showspace from Ferret utility.

this would display free space per cylinder. When the free space is low,it would run minicylpacks which can impact the performance of system . So DBA's need to keep an eye on Free space per cylinder.

How do you determine whether system is running out of space?

1. By using DBC.Diskspace ,

Select

sum(currentperm)/1024/1024/1024 "cur_space_gb" ,

sum(maxperm)/1024/1024/1024 "max_space_gb",

case when ((max_space_gb - cur_space_gb)/max_space_gb) *100 < 25

then 'Warning! Packdisk has to be run'

else 'No Packdisk activity required!' end

from DBC.Diskspace ;

16 May 2013

How do you create materialized view in Teradata?

There is no such things as a matirialized view, the wpuivalant in teradata whould be a join index (or aggrigatr imfrv

13 May 2013

HELP Commands in Teradata Database objects

The HELP Command is used to display information about Teradata database objects such as -

Databases and Users ,Tables ,Views and Macros.

HELP fetches information about above mentioned objects from the Teradata Data Dictionary.

There are various HELP command available in Teradata.

HELP Command

HELP DATABASE databasename;

HELP USER username;

HELP TABLE tablename;

HELP VIEW viewname;

HELP MACRO macroname;

HELP COLUMN table or viewname.*; (to show all columns)

HELP COLUMN table or viewname.colname . . ., colname;

HELP INDEX tablename;

HELP STATISTICS tablename;

HELP CONSTRAINT constraintname;

HELP JOIN INDEX join_indexname;

HELP TRIGGER triggername;

10 May 2013

Using BTEQ Conditional Logic

Below is a BTEQ batch script example. The initial steps of the script will establish the logon, the database, and the delete all the rows from the Employee_Table. If the table does not exist, the BTEQ conditional logic will instruct Teradata to create it. However, if the table already exists, then Teradata will move forward and insert data.

Note: In script examples, the left panel contains BTEQ base commands and the right panel provides a brief description of each command.

.RUN FILE = mylogon.txt	Logon to Teradata
DATABASE SQL_Class;	Make the default database SQL_Class
DELETE FROM Employee_Table;	Deletes all the records from the Employee_Table.
.IF ERRORCODE = 0 THEN .GOTO INSEMPS /* ERRORCODE is a reserved word that contains the outcome status for every SQL statement executed in BTEQ. A zero (0) indicates that statement worked. */	BTEQ conditional logic that will check to ensure that the delete worked or if the table even existed.
$image\img00357.gif$	If the table did not exist, then BTEQ will create it. If the table does exist, the Create table step will be skipped and directly GOTO INSEMPS.
.LABEL INSEMPS INSERT INTO Employee_Table (1232578, ‘Chambers’ ,’Mandee’, 48850.00, 100); INSERT INTO Employee_Table (1256349, ‘Harrison’ ,’Herbert’, 54500.00, 400); .QUIT

BASIC EXPORT SCRIPT

BatchScript.txt File

.LOGON CDW/sql00,whynot	Logon statement onto Teradata. Notice the “,” before the password. This is in batch script format.
SELECT * FROM Employee_Table WHERE Dept_No = 400;	The actual SQL
.LOGOFF	Logging off of Teradata

Submitting SQL in BTEQ’s Interactive Mode

Once you logon to Teradata through BTEQ, you are ready to run your queries. Teradata knows the SQL is finished when it finds a semi-colon, so don’t forget to put one at the end of your query. Below is an example of a Teradata table to demonstrate BTEQ operations.

Employee_Table

Employee_No	Last_Name	First_Name	Salary	Dept_No
2000000 1256349 1333454 1121334	Jones Harrison Smith Strickling	Squiggy Herbert John Cletus	32800.50 54500.00 48000.00 54500.00	? 400 200 400

Figure 2-1

BTEQ execution

.LOGON cdw/sql01;	Type at command prompt: Logon with TDPID and USERNAME.
Password: XXXXX	Then enter PASSWORD at the second prompt.
Enter your BTEQ/SQL Request or BTEQ Command.	BTEQ will respond and is waiting for a command.
SELECT * FROM Employee_Table WHERE Dept_No = 400;	An SQL Statement
* Query Completed. 2 rows found. 5 Columns returned. * Total elapsed time was 1 second.	BTEQ displays information about the answer set.

The result set

Employee_No	Last_name	First_name	Salary	Dept_No
1256349	Harrison	Herbert	54500.00	400
1121334	Strickling	Cletus	54500.00	400

Figure 2-2

Submitting SQL in BTEQ’s Batch Mode

On network-attached systems, BTEQ can also run in batch mode under UNIX (IBM AIX, Hewlett-Packard HP-UX, NCRMP-RAS, Sun Solaris), DOS, Macintosh, Microsoft Windows and OS/2 operating systems. To submit a job in Batch mode do the following:

1. Invoke BTEQ

2. Type in the input file name

3. Type in the location and output file name.

The following example shows how to invoke BTEQ from a DOS command. In order for this to work, the directory called Program Files\NCR\Teradata Client\bin must be established in the search path.

C:/> BTEQ < BatchScript.txt > Output.txt

BTEQ is invoked and takes instructions from a file called BatchScript.txt. The output file is called Output.txt.

Figure 2-3

Notice that the BTEQ command is immediately followed by the ‘<BatchScript.txt’ to tell BTEQ which file contains the commands to execute. Then, the ‘>Output.txt’ names the file where the output messages are written. Here is an example of the contents of BatchScript.txt file.

BatchScript.txt File

.LOGON CDW/sql00,whynot	Logon statement onto Teradata. Notice the “,” before the password. This is in batch script format.
SELECT * FROM Employee_Table WHERE Dept_No = 400;	The actual SQL
.LOGOFF	Logging off of Teradata

Figure 2-4

The above illustration shows how BTEQ can be manually invoked from a command prompt and displays how to specify the name and location of the batch script file to be executed.

The previous examples show that when logging onto BTEQ in interactive mode, the user actually types in a logon string and then Teradata will prompt for a password. However, in batch mode, Teradata requires both a logon and password to be directly stored as part of the script.

Since putting this sensitive information into a script is scary for security reasons, inserting the password directly into a script that is to be processed in batch mode may not be a good idea. It is generally recommended and a common practice to store the logon and password in a separate file that that can be secured. That way, it is not in the script for anyone to see.

For example, the contents of a file called “mylogon.txt” might be:

.LOGON cdw/sql00,whynot.

Then, the script should contain the following command instead of a .LOGON, as shown below and again in the following script: .RUN FILE=mylogon.txt

This command opens and reads the file. It then executes every r