This White Paper on Spool Space in Teradata was presented by Nazir Iqbal at Wipro where he works at present.
TERADATA Spool Space is unused Perm Space that it used for running queries.
2. Table of Contents
1. Introduction………………………………………………………………………………………….3
1.1 Spool Space………………………………………………………………………………….3
1.2 Spool Space and Capacity Planning………………………………………………….3
1.3 Spool Space Categories………………………………………………………………….4
1.4 Spool Space Allocation……………………………………………………………………4
2. Causes of spool space error and how to minimize it……………………………………4
3. Know the data…………………………………………………………………………………………5
4. Primary Index…………………………………………………………………………………………5
5. Multiset or Set table…………………………………………………………………………………6
6. Collect Statistics………………………………………………………………………………………7
7. Skewing………………………………………………………………………………………………….7
8. Conclusion……………………………………………………………………………………………….8
9. References………………………………………………………………………………………………8
2
3. 1. INTRODUCTION:
1.1 SPOOL SPACE:
TERADATA Spool Space is unused Perm Space that it used for running queries. Spool Space is used
to hold intermediate rows during processing, and to hold the rows in the answer set of a transaction.
TERADATA recommends 20% of the available perm space is allocated for Spool space but various
across applications.
In the majority of cases, well written SQL queries should not use huge amounts of spool space. A
poor choice of join column, product join and lack of statistics are the main reason of excessive spool
space consumption. Each user can be set a spool space limit. In later version of TERADATA, this is
often set in the user’s profile.
Insufficient spool error is usually the result of poor table design, poor data distribution, or a poorly
written query. Running out of Spool Space will give the user an error code 2646.
1.2 SPOOL SPACE AND CAPACITY PLANNING: H E C K
Spool Space and Capacity Planning are mutually dependent concepts.
Spool space is critical to the operation of Teradata RDBMS, yet it is frequently
overlooked in capacity planning. Size requirements vary from user to user, table to table and
application to application.
For instance,
• The Spool space of a user is used to hold the response rows of every query run by that user during
a session. Thus, each user needs a high enough spool allocation to contain the biggest anticipated
answer set.
•Tables containing huge data require more available spool space than smaller tables, because
intermediate rows are held in spool space during query execution.
1.3. SPOOL SPACE CATEGORIES
Spool falls into three categories of space.
They are:-
Volatile
Intermediate
Output
3
4. Volatile Intermediate Output
This Spool is retained until the Intermediate spool results are Output Spool results are either
transaction completes (unless retained until no longer needed. final rows returned in the answer
the table was created with ON We can determine when set for a query, rows updated
COMMIT PRESERVE ROW), table intermediate spool is flushed by within, inserted into, or deleted
is dropped manually during the examining the output of an from a base table.
session, Session ends or Teradata EXPLAIN.
RDBMS resets.
1.4. SPOOL SPACE ALLOCATION
Teradata RDBMS allocates spool space dynamically only from disk cylinders that are not being used
for permanent or temporary data. Permanent, temporary, and spool data blocks cannot co-exist on
the same cylinder. Spool space is not reserved. All unused space in the Teradata RDBMS is
considered available spool space. When spool is released, the file system returns the cylinders it was
using to the free cylinder list.
We allocate spool space for a database, a user, or a user profile, not at the table level.
A SPOOL limit defined in a profile takes effect upon completion of a:
• CREATE/MODIFY USER statement that assigns the profile to a user.
• MODIFY PROFILE statement that changes the spool space limit.
If the user is logged on, the profile specification affects the current session.
Inefficient SQL queries generally results in Capacity Planning and Spool Space Allocation going
wayward and throws up Spool Space Error which is one of the most common error encountered by a
Teradata SQL Programmer.
2. CAUSES OF SPOOL SPACE ERROR AND HOW TO MINIMIZE IT:
When Resource thresholds are met, like Spool Space exceeded, then either a warning is given by
the DBAs or the query is aborted by them. Different thresholds are set for tactical, decision support,
and ad-hoc scenarios.
High skew is another cause for Spool Space been exceeded.
Not all alerts or warnings indicate there is a problem, as some transactions use high CPU and spool
because of large data volumes and complexity of the code.
Often we have en-countered scenarios where a SQL query has been running for a long time.
The reasons may be:
• Missing or aged statistics.
• Large product joins.
• Merge joins where there is a many-to-many relationship.
4
5. • Set tables that should be Multiset.
• Stats reflect zero rows on a table, yet are not empty.
• A change in data volume which requires additional stats and it will generate a
different explain plan.
• Unbalanced parenthesis.
The key is to know the data before writing SQL codes.
3. KNOW THE DATA
Below are a few questions that should always be given a thought so that SQL codes are efficient and
do not exceed the thresholds of Spool Space or CPU.
1. How many rows exist on the tables in the query?
2. What columns are we joining on?
3. Do we need to add filters or additional joins to reduce volume?
4. How many unique values exist on columns?
5. How many rows exist on tables that are duplicated?
6. Queries having derived tables will often show no confidence because the optimizer does not
know how many rows are in a derived table.
7. High estimated time can indicate aged stats i.e. stats should be collected again.
8. What type of join is performed?
Product Join Merge Join Hash Join
This is a cross join every row from Requires sort of spool files. Merge The tables do not have to be sorted
one table is joined to every row on join are efficient when there is not and the smaller table can be much
the second table. Spool file is as a many to many relationship on larger than for a product join. The
large as (No. Of rows table_one * columns involved in the join. smaller table/spool is "hashed" into
No. Of rows table_two), large If there is a many to many memory. Then, the larger table is
product joins (billions of rows) relationship, try to aggregate the scanned and for each row, it looks
should be avoided. Product joins columns on one table to reduce the up the row from the smaller table
are most efficient when a SMALL volume by creating a volatile table, in the hashed table that was
lookup table is duplicated. Product derived table or work table. created in memory. If the smaller
joins are inefficient when large fact table is broken into partitions to fit
tables are duplicated (this can into memory, the larger table must
indicate aged or missing stats). also be broken into the same
partitions prior to the join.
4. PRIMARY INDEX
A poor primary index having lumpy distribution data which can cause a query to run
several hours when it should execute in seconds/minutes. Hence, we should choose a
single column or multiple columns that distribute the data evenly across all AMPS.
5
6. Eliminate columns from the primary index that have a lot of null values. Value change
rate should be low or never. Column(s) should be frequently used in join constraint.
Teradata is a multi parallel processor so a query runs as well as the SLOWEST AMP. If the
table joins to a similar table having the same columns, the primary index on both tables
should be the same.
AMP-4 has much more data than AMP-1, AMP-2 and AMP-3 which causes Spool Space Error.
Choice of primary index should be such to avoid such un-even data distribution across AMPs.
5. MULTISET OR SET TABLE
A set table performs a duplicate row check. If there are a lot of non unique values for a
primary index, this can be very CPU intensive. For example, for a primary index having 2000
values a duplicate row check will be performed 4,000,000 times. This is referred to as
chaining. The first record is loaded. The next record having the same PI value to load, checks
all the columns of the first one to determine if it is a duplicate. Once the third record is
loaded, it checks both the first and the second records and so on.
A Multiset table allows duplicate rows so the duplicate row check is omitted. If duplicates can
be omitted using a group by or filtered programmatically, a load to a multiset table performs
better.
A Multiset table having a NUPI, non unique primary index, with occurrences between 500 –
2000 is not bad.
For tables having non unique primary index where there are several hundred or a couple of
thousands values for a given primary index ‘use a multiset table’
For tables having a more unique index like 1 to 10 values for a give primary index ‘use a set
table’
Note: the FASTLOAD utility program will not allow duplicates, even if the target table is
MULTISET.
6
7. 6. COLLECT STATISTICS
Poor or missing statistics OR Aged statistics may cause Spool Space Error.
TERADATA recommends that COLLECT STATS should include:-
1. Individual columns in an index.
2. All columns in an index, multi-column where size is less than 16 bytes.
3. Join columns.
4. Filter or qualifying columns.
5. Secondary Index
Statistics are not needed for temp tables that are not joined to other tables and only used
for staging.
Be careful to NOT over collect on statistics. If a table is updated by several inserts multiple
times a day, the statistics do not need to be refreshed after each insert. One collection is
Significant after the last insert. For tables being completely refreshed, the statistics are
Needed after the refresh.
TOOL TO CHECK THE EFFICIENCY OF A QUERY:
Run this diagnostic command before the explain of the query.
At the bottom of the explain it will list the statistics that are missing.
Diagnostic helpstats on for session;
Explain
<OUR SQL TEXT>
WE SHOULD NOT COLLECT STATISTICS ON EMPTY TABLES. This will cause the optimizer to
choose an inefficient path based on the information available to the parser. Statistics should
be collected when a table is initially loaded and anytime the table’s demographics change by
more than 10%. After the initial collect statistics on an object, the user can run the
statement below to refresh the table’s statistics based on the (new) data.
Collect statistics on databasename.tablename; -- this will refresh all stats, index and
column, that were previously gathered on a table
To see the statistics that exist for a table, run the following:
HELP STATISTICS databasename.tablename;
7. SKEWING:
Proper primary index specification should evenly distribute the rows of a table across the
AMPs. This prevents skewing. The Query Log Information in SQL Assistant and other Editors
tells us about the degree of Skewing in a query.CPU Skew > 50 reflects worse case scenarios
and generally any query having a CPU Skew > 4 is considered poor performing. Hence, by
seeing the CPU Skew from the Query Log Information a programmer can easily make out
which query needs to be fine-tuned to avoid high Skew.
7
8. CONCLUSION:
Most of the performance related issues are caused by poor indexing, missing statistics, aged
statistics , over collecting statistics, mismatched data types and missing filters and conditions
on where clause. These can be eliminated if we follow the best practices of sound SQL
techniques as discussed above. The key to efficient SQL coding is good knowledge of the
database and understanding the various join-constraints and the mappings. Database
knowledge accompanied with adherence to collect stats feature of TERADATA is the key to
avoidance of Spool Space Error.
REFERENCES:
1. http://www.info.teradata.com/Datawarehouse/eBrowseBy.cfm?page=TeradataDatabase
2. http://www.teradatatech.com/
3. www.google.com
8