Oracle7 Server Tuning Go to Product Documentation Library
Library
Go to books for this product
Product
Go to Contents for this book
Contents
Go to Index
Index



Go to previous file in sequence Go to next file in sequence

Parallel Query Tuning


Parallel Query Tuning


The parallel query feature is capable of dramatic response time reductions for data-intensive operations on very large decision support databases. The first half of this chapter outlines three basic parallel query tuning steps to get you up and running. The second half provides detailed information to help you diagnose and solve tuning problems.

See Also: Appendix C, "Parallel Query Concepts", to understand the basic principles of parallel query processing.

See your operating system-specific Oracle documentation for more information about tuning while using the parallel query feature.

Introduction to Parallel Query Tuning


The parallel query feature is useful for queries that access a large amount of data by way of large table scans, large joins, the creation of large indexes, bulk loads, aggregation, or copying. It benefits systems with all of the following characteristics:

If any one of these conditions is not true for your system, the parallel query feature may not significantly help performance. In fact, on over-utilized systems or systems with small I/O bandwidth, the parallel query feature can impede system performance.

The three basic steps for tuning the parallel query are outlined in the following sections.

Step 1: Tuning System Parameters for the Parallel Query


Many initialization parameters affect parallel query performance. For best results, start with an initialization file that is appropriate for the intended application.

Before starting the Oracle Server, set the following initialization parameters. The recommended settings are guidelines for a large data warehouse (more than 100 gigabytes) on a typical high-end shared memory multiprocessor with one or two gigabytes of memory. Each section explains how to modify these settings for other configurations. Note that you can change some of these parameters dynamically with ALTER SYSTEM or ALTER SESSION statements. The parameters are grouped as follows:


Parameters Affecting Resource Consumption

You must configure resources at two levels:

On some platforms you may need to set operating system parameters which control the total amount of virtual memory available, summed across all processes.

As a general guideline for memory sizing, note that each process needs address space big enough for its hash joins. A dominant factor in heavyweight decision support (DSS) queries is the relationship between memory, number of processes, and number of hash join operations. Since hash joins and large sorts are memory hungry operations, you may want to configure fewer processes, each with a greater limit on the amount of memory it can use.

Memory available for DSS queries comes from process memory, which in turn comes from virtual memory. Total virtual memory should be somewhat more than available real memory, which is the physical memory minus the size of the SGA.

The SGA is static, of fixed size. Typically it comes out of the real physical memory. If you want to change the size of the SGA you must shut down the database, make the change, and restart the database. DSS memory is much more dynamic. It comes out of process memory: and both the size of a process' memory and the number of processes can vary greatly.

Virtual memory is typically more than physical memory, but should not generally exceed twice the size of the physical memory less the SGA size. If you make it many times more than real memory, the paging rate may go up when the machine is overloaded at peak times.

Tune the following parameters to ensure that resource consumption is optimized for your parallel query needs.

Attention: "Understanding Parallel Query Performance Issues" on page 18-26 describes in detail how these parameters interrelate, and provides a formula to help you balance their values.


HASH_AREA_SIZE

Recommended Value: at least 8MB for a large data warehouse. May range from 8MB to 32MB or more.

This parameter provides for adequate memory for hash joins. Each process that performs a parallel hash join uses an amount of memory equal to HASH_AREA_SIZE.

Hash join performance is more sensitive to HASH_AREA_SIZE than sort performance is to SORT_AREA_SIZE. As with SORT_AREA_SIZE, too large a hash area may cause the system to run out of memory.

The hash area does not cache blocks in the buffer cache; even low values of HASH_AREA_SIZE will not cause this to occur. Too small a setting could, however, affect performance.

See Also: "SORT_AREA_SIZE" on page 18-9


OPTIMIZER_PERCENT_PARALLEL

Recommended Value: 100/number_of_concurrent_users
This parameter determines how aggressively the optimizer will attempt to parallelize a given execution plan. OPTIMIZER_PERCENT_PARALLEL encourages the optimizer to use plans that have low response time because of parallel execution, even if total resource used is not minimized.

OPTIMIZER_PERCENT_PARALLEL
The default value of OPTIMIZER_PERCENT_PARALLEL is 0, which parallelizes the plan which uses the least resource, if possible. (Plans which use index access rather than table scans are not parallelized.) Here, the execution time of the query may be long because only a small amount of resource is used. A value of 100 causes the optimizer to choose a parallel plan unless a serial plan would complete faster.

Note: Selecting a single record from a table, if there is an appropriate index, can be done very quickly and does not require parallelism. A full scan to find the single row can be executed in parallel. Normally, however, each parallel process examines many rows. In this case response time of a parallel plan will be higher and total system resource use would be much greater than if it were done by a serial plan using an index. With a parallel plan, the delay is shortened because more resource is used. The parallel plan could use up to D times more resource, where D is the degree of parallelism. A value between 0 and 100 sets an intermediate trade-off between throughput and response time. Low values favor indexes; high values favor table scans.

A FIRST_ROWS hint or optimizer mode will override a non-zero setting of OPTIMIZER_PERCENT_PARALLEL.


PARALLEL_MAX_SERVERS

Recommended Value: 2 * CPUs * number_of_concurrent_users
Most queries need at most twice the number of query server processes as the maximum degree of parallelism attributed to any table in the query. By default this is at most twice the number of CPUs. The following figure illustrates how the recommended value is derived.

PARALLEL_MAX_SERVERS = 2 * CPUs * Users
To support concurrent users, add more query servers. Note that if a database's users start up too many concurrent queries, Oracle may run out of query servers. Should this happen, Oracle will execute the query sequentially, or give an error if PARALLEL_MIN_PERCENT is set.


PARALLEL_MIN_SERVERS

Recommended Value: PARALLEL_MAX_SERVERS
The system parameter PARALLEL_MIN_SERVERS allows you to specify the number of processes to be started and reserved for parallel query operations at startup in a single instance. The syntax is:

PARALLEL_MIN_SERVERS=n

where n is the number of processes you want to start and reserve for parallel query operations. Sometimes you may not be able to increase the maximum number of query servers for an instance because the maximum number depends on the capacity of the CPUs and the I/O bandwidth (platform-specific issues). However, if servers are continuously starting and shutting down, you should consider increasing the value of the parameter PARALLEL_MIN_SERVERS.

For example, if you have determined that the maximum number of concurrent query servers that your machine can manage is 100, you should set PARALLEL_MAX_SERVERS to 100. Next determine how many query servers the average query needs, and how many queries are likely to be executed concurrently. For this example, assume you will have two concurrent queries with 20 as the average degree of parallelism. At any given time, there could be 80 query servers busy on an instance. You should therefore set the parameter PARALLEL_MIN_SERVERS to 80.

Consider decreasing PARALLEL_MIN_SERVERS if fewer query servers than this value are typically busy at any given time. Idle query servers constitute unnecessary system overhead.

Consider increasing PARALLEL_MIN_SERVERS if more query servers than this value are typically active, and the "Servers Started" statistic of V$PQ_SYSSTAT is continuously growing.


SHARED_POOL_SIZE

Recommended Value: default plus
(3 * msgbuffer_size) * (CPUs + 2) * PARALLEL_MAX_SERVERS

Increase the initial value of this parameter to provide space for a pool of message buffers that parallel query servers can use to communicate with each other. As illustrated in the following figure, assuming 4 concurrent users and 2K buffer size, you would increase SHARED_POOL_SIZE by 6K * (CPUs + 2) * PARALLEL_MAX_SERVERS for a pool of message buffers that parallel query servers use to communicate. This value grows quadratically with the degree of parallelism.

Increasing SHARED_POOL_SIZE with Degree of Parallelism
Note: The message buffer size might be 2K or 4K, depending on the platform. Check your platform vendor's documentation for details.

On Oracle Parallel Server, there can be multiple CPUs in a single node, and parallel query can be performed across nodes. Whereas SMP systems use 3 buffers for connection, 4 buffers are used to connect between instances on Oracle Parallel Server. Thus you should normally have 4 buffers in shared memory: 2 in the local shared pool and 2 in the remote shared pool. The formula for increasing the value of SHARED_POOL_SIZE on OPS becomes:

(4 * msgbuffer_size) * ((CPUs_per_node * #nodes ) + 2) * (PARALLEL_MAX_SERVERS * #nodes)

Note that the degree of parallelism on OPS is expressed by the number of CPUs per node multiplied by the number of nodes.


SORT_AREA_SIZE

Sample Range: 256K to 4M
This parameter specifies the amount of memory to allocate per query server for sort operations. If memory is abundant on your system, you can benefit from setting SORT_AREA_SIZE to a large value. This can dramatically increase the performance of sort and hash operations since the entire operation is more likely to be performed in memory. However, if memory is a concern for your system, you may want to limit the amount of memory allocated for sorts and hashing operations and increase the size of the buffer cache so that data blocks from temporary sort segments can be cached in the buffer cache.

If the sort area is too small, an excessive amount of I/O will be required to merge a large number of runs. If the sort area size is smaller than the amount of data to sort, then the sort will spill to disk, creating sort runs. These must then be merged again using the sort area. If the sort area size is very small, there will be many runs to merge, and it may take multiple passes to merge them together. The amount of I/O increases as the sort area size decreases.

If the sort area is too high the operating system paging rate will be excessive. The cumulative sort area adds up fast because each parallel query server can allocate this amount of memory for each sort. Monitor the operating system paging rate to see if too much memory is being requested.

See Also: "HASH_AREA_SIZE" on page 18-4


Parameters Enabling New Features

Set the following parameters in order to use the latest available functionality.


ALWAYS_ANTI_JOIN

Recommended Value: HASH
When set to HASH, this parameter causes the NOT IN operator to be evaluated in parallel using a parallel hash anti-join. Without this parameter set to HASH, NOT IN is evaluated as a (sequential) correlated subquery.

Parallel Hash Anti-join
As illustrated above, the SQL IN predicate can be evaluated using a join to intersect two sets. Thus emp.deptno can be joined to dept.deptno to yield a list of employees in a set of departments.

Alternatively, the SQL NOT IN predicate can be evaluated using an anti-join to subtract two sets. Thus emp.deptno can be anti-joined to dept.deptno to select all employees who are not in a set of departments. Thus you can get a list of all employees who are not in the Shipping or Receiving departments.


COMPATIBLE

Sample Value: 7.3.3
This parameter enables new features that may prevent you from falling back to an earlier release. To be sure that you are getting the full benefit of the latest performance features, set this parameter equal to the current release. The default value is release 7.0.12, which lacks many important new features.

For example, direct read for table scans and sorts was introduced in release 7.1.5. This feature greatly speeds up table scans on SMP platforms. If not set, all reads will go through the buffer cache. Similarly, temporary tablespaces, introduced in release 7.3, improve efficiency of sort and hash joins.

Note: Make a full backup before you change the value of this parameter.


HASH_JOIN_ENABLED

Recommended Value: TRUE
This parameter enables hash joins, which can be much faster than sort merge joins.

See Also: Oracle Server Concepts


PARTITION_VIEW_ENABLED

Recommended Value: TRUE
This parameter enables optimization of UNION ALL views in order to provide key range partitioning and partition skipping. Specifically, CHECK CONSTRAINTS and view predicates are combined with predicates in user queries to skip over partitions that are not needed to answer the user query. Statistics and index information from the partition tables are combined and used by the optimizer just as if the view were a real table.

Partition Views
See Also: Chapter 11, "Managing Partition Views"


Parameters Related to I/O

Tune the following parameters to ensure that I/O operations are optimized for the parallel query.


DB_BLOCK_SIZE

Recommended Value: 8K or 16K
The database block size must be set when the database is created. If you are creating a new database, use a large block size.


DB_FILE_MULTIBLOCK_READ_COUNT

Recommended Value: 8, for 8K block size; 4, for 16K block size
This parameter determines how many database blocks are read with a single operating system READ call. Many platforms limit the number of bytes read to 64K, limiting the effective maximum for an 8K block size to 8. Other platforms have a higher limit. For most applications, 64K is acceptable.

In general, use the formula:
DB_FILE_MULTIBLOCK_READ_COUNT = 64K/DB_BLOCK_SIZE.


HASH_MULTIBLOCK_IO_COUNT

Recommended Value: 4
Increasing the value of HASH_MULTIBLOCK_IO_COUNT decreases the number of hash buckets. If a system is I/O bound, you can increase the efficiency of I/O by having larger transfers per I/O.

Because memory for I/O buffers comes from the HASH_AREA_SIZE, larger I/O buffers mean fewer hash buckets. There is a trade-off, however. For large tables (hundreds of gigabytes in size) it is better to have more hash buckets and slightly less efficient I/Os. If you find an I/O bound condition on temporary space during hash join, consider increasing this value.


SORT_DIRECT_WRITES

Recommended Value: AUTO
This parameter causes the buffer cache to be bypassed for the writing of sort runs. Reading through the buffer cache may result in greater path length, excessive memory bus utilization, and LRU latch contention on SMPs. Avoiding the buffer cache can thus provide performance improvement by a factor of 3 or more. It also removes the need to tune buffer cache and DBWR parameters.

Serial and parallel queries which involve serial table scans (but not parallel table scans), and all index lookups, use the buffer cache for both the index and the table which the index references. Sorts will use the buffer cache if SORT_DIRECT_WRITES is FALSE or set to AUTO, but the SORT_AREA_SIZE is small. INSERT, UPDATE, and DELETE statements also use the buffer cache. By contrast, CREATE INDEX and CREATE TABLE AS SELECT statements do not use the buffer cache.

If paging is high, it is a symptom that the relationship of memory, users, and query servers is out of balance. To rebalance it, you can reduce the sort or hash area size. You can limit the amount of memory for sorts if SORT_DIRECT_WRITES is set to AUTO but the SORT_AREA_SIZE is small. Then sort blocks will be cached in the buffer cache. Note that SORT_DIRECT_WRITES has no effect on hashing.

See Also: "HASH_AREA_SIZE" on page 18-4

"The Formula for Memory, Users, and Query Servers" on page 18-26


SORT_READ_FAC

Recommended Value: Same as DB_FILE_MULTIBLOCK_READ_COUNT
This parameter sets the number of blocks read with a single operating system read call, from a temporary tablespace during a sort.


USE_ASYNC_IO or ASYNC_WRITE

Recommended Value: TRUE
This parameter allows parallel query server processes to overlap I/O requests with processing when performing table scans. Asynchronous read must be enabled for this to occur.

Asynchronous Read
Asynchronous operations are currently supported with parallel table scans and hash joins only. They are not supported for sorts, or for serial table scans. In addition, this feature may require operating system specific configuration and may not be supported on all platforms. Check your Oracle platform-specific documentation.

Step 2: Tuning Physical Database Layout for the Parallel Query


This section describes how to tune the physical database layout for optimal performance of parallel query. To this end, it presents a case study showing how to prepare a simple database for parallel query.


A Case Study

The case study in this chapter illustrates how to create, load, index, and analyze a large fact table, partitioned using partition views, in a typical star schema. The example 120 GB table is named "facts". This case study assumes a 10 CPU shared memory computer with more than 100 disk drives. Thirty 4Gb disks will be used for base table data, 10 disks for index, and 30 disks for temporary space. Additional disks are needed for rollback segments, control files, log files, possible staging area for loader flat files, and so on. The facts table will be partitioned by month into 12 logical partitions. Each partition will be spread evenly over 10 disks, so that a scan which accesses few partitions, or a single partition, can still proceed with full parallelism.


Striping Data

To avoid I/O bottlenecks, you should stripe all tablespaces accessed in parallel over at least as many disks as the degree of parallelism. Stripe over at least as many devices as CPUs. This includes tablespaces for tables, tablespaces for indexes, and temporary tablespaces. You must also spread the devices over controllers, I/O channels, and/or internal busses.

Striping Objects Over at Least as Many Devices as CPUs
To stripe data during load, use the FILE= clause of parallel loader to load data from multiple load sessions into different files in the tablespace. For any striping to be effective, you must ensure that there are enough controllers and other I/O components to support the bandwidth of parallel data movement into and out of the striped tablespaces.

The operating system or volume manager can perform striping (OS striping), or you can manually perform striping for parallel operations.


Operating System Striping

Operating system striping is usually flexible and easy to manage. It supports multiple users running sequentially as well as single users running in parallel.

Good stripe sizes for table data are 128K, 256K, 1MB, and 5MB, depending on block size and size of the data object. For example, with a medium size database (perhaps 20G), 1MB is a good stripe size. On a very large database (over 100G), 5BM tends to be the best stripe size. These recommended sizes represent a compromise between the requirements of query performance, backup and restore performance, and load balancing. Setting the stripe size too small will detract from performance, particularly for backup and restore operations.

See Also: For MPP systems, see your platform-specific Oracle documentation regarding the advisability of disabling disk affinity when using operating system striping.


Manual Striping

Manual striping can be done on all platforms. This requires more DBA planning and effort to set up, and may yield better performance if only a single query is running. This may not be the case, however, if many queries are running.

For manual striping add multiple files, each on a separate disk, to each tablespace.


Striping and Media Recovery

Striping affects media recovery. Loss of a disk usually means loss of access to all objects that were stored on that disk. If all objects are striped over all disks, then loss of any disk takes down the entire database. Furthermore, all database files may have to be restored from backups, even if each file has only a small fraction actually stored on the failed disk.

Often, the same OS subsystem that provides striping also provides mirroring. With the declining price of disks, mirroring can provide an effective alternative solution to backups and log archival. Disaster recovery is still an issue, however. Even cheaper than mirroring is RAID technology, which avoids full duplication in favor of more expensive write operations. For read-mostly applications, this may suffice.

Note: RAID5 technology is particularly slow on write operations. This may affect your database restore time to a point that RAID5 performance is unacceptable.

See Also: For a discussion of manually striping tables across datafiles, refer to "Striping Disks" on page 14-23.

For a discussion of media recovery issues, see "Backup and Recovery of the Data Warehouse" on page 6-8.

For more information about automatic file striping and tools to use to determine I/O distribution among your devices, refer to your operating system documentation.


Partitioning Data

With the partition view feature you can use the UNION ALL construct to partition a large table into several smaller tables and make the partitioning transparent to queries using a view. This feature supports:

When combined with existing features and a few tips and techniques, partition views provide a flexible and powerful partitioning capability. For example, a new partition can be added, an existing partition can be reorganized, or an old partition can be dropped with less than a second of interruption to a read-only application.


When to Use Partition Views in a Data Warehouse

Consider using partition views in a data warehouse in any of the following situations:


Partition View Trade-offs

The case study in this chapter illustrates how to use a partition view for maximum query performance, not necessarily for minimum downtime in the event of disk failure.

Here, each partition is spread across one third of the disks in the tablespace so that loss of a single disk causes 4 out of 12 partitions to become unavailable.

Alternatively, partitions may be assigned to disks such that a disk failure takes out a single partition, and surviving partitions remain available. The trade-off is that queries against a single partition may not scale due to the limited I/O bandwidth of two or three disks.

For best performance with Oracle Parallel Server, the physical layout of individual partitions should be optimized such that each is scanned in parallel, and each has the best possible I/O throughput to query servers. To avoid I/O bottlenecks when not all partitions are being scanned (because some have been eliminated), each partition should be spread over a number of devices. Furthermore, on parallel server systems, those devices should be spread over multiple nodes.

See Also: Chapter 11, "Managing Partition Views"

Your operating system specific Oracle documentation.


Determining the Degree of Parallelism

Oracle automatically computes the default parallel degree of a table as the minimum of the number of disks storing the table and the number of CPUs available. If, as recommended, you have spread objects over at least as many disks as you have CPUs, the default parallelism will always be the number of CPUs. Warehouse operations are typically CPU bound; thus the default is a good choice, especially if you are using the new asynchronous readahead feature. Consider explicitly setting the parallel degree to 2 * CPUs if you are performing synchronous reads. Consider reducing parallelism for objects that are frequently accessed by two or more concurrent parallel queries.

Should you find that some operations are I/O bound with the default parallelism, and you have more disks than CPUs, override the usual parallelism with a hint that increases parallelism up to the number of disks, or until the CPUs become saturated.

Common I/O bound operations are:

Note that the degree of parallelism for a partition view is conservatively set by default to be the maximum of the degrees of all partitions (not the sum).


Using Parallel Load

This section describes how to load a large table, such as the FACT table in a decision support database. This example uses SQL Loader to explicitly stripe data over 30 disks in a tablespace. The disks in this example have OS file names /dev/D1, /dev/D2, ... , /dev/D30.

  1. Create the tablespace.

    Below is the command to create a tablespace named "TSfacts". We specify a single datafile with the create command. Later we will specify 29 more datafiles to add to TSfacts. Oracle initializes each block in the datafile, so we want to add the datafiles in parallel to speed up datafile initialization.

    			CREATE TABLESPACE TSfacts  
    DATAFILE '/dev/D1' SIZE 4096MB REUSE
    DEFAULT STORAGE (INITIAL 64K NEXT 100MB PCTINCREASE 0);

    Extent sizes should be multiples of the multiblock read size, where

    blocksize * MULTIBLOCK_READ_COUNT = multiblock read size

    Note in particular the following aspects of our approach:

    In Release 7.3 objects can have an unlimited number of extents provided you have set the COMPATIBLE system parameter and use the MAXEXTENTS keyword on the CREATE or ALTER command for the tablespace or object. In practice, however, a limit of 10,000 extents per object is reasonable.

    Each loader process can typically load between 1 and 2 gigabytes per hour. Since we have 10 CPUs, we will use parallel degree 10.

    Note: It is not desirable to allocate extents faster than about 2 or 3 per minute. See the "ST Enqueue" on page 18-34 for more information.

    Thus, each process should get an extent that will last for 3-5 minutes. That is at least 50MB. The largest possible extent size for 30 disks with 4 partitions each is 1G. 100MB extents should work nicely. Each partition will have 100 extents. The default storage parameters can be customized for each object created in the tablespace, if needed.

  1. Add datafiles in parallel.

    To speed datafile initialization, add the datafiles in parallel. Use a scripting language like csh or perl to multiple instances of Server Manager or SQL*Plus in line mode.

    Create as many datafiles as the degree of parallelism you will use for creating and loading objects in the tablespace. This reduces fragmentation and consequent waste of space in the tablespace. Create multiple datafiles even if you are using OS stripes.

    We run the following commands concurrently as background processes:

    		ALTER TABLESPACE TSfacts ADD DATAFILE '/dev/D2' SIZE 4096MB REUSE
    ALTER TABLESPACE TSfacts ADD DATAFILE '/dev/D3' SIZE 4096MB REUSE
    ...
    ALTER TABLESPACE TSfacts ADD DATAFILE '/dev/D30' SIZE 4096MB REUSE

  1. Create the partition view.

    We create a table in the TSfacts tablespace for each partition of the partition view. It contains multiple dimensions and multiple measures. The partition view itself is a view named "facts". The partitioning column is named "dim2" and is a date. There are other columns as well.

    			CREATE TABLE fact_1 (dim_1 NUMBER, dim_2 DATE, ...
    meas_1 NUMBER, meas_2 NUMBER, ... )TABLESPACE TSfacts PARALLEL;
    CREATE TABLE fact_2 (dim_1 NUMBER, dim_2 DATE,
    meas_1 NUMBER, meas_2 NUMBER, ... ) TABLESPACE TSfacts PARALLEL;
    . . .
    CREATE TABLE fact_12 (dim_1 NUMBER, dim_2 DATE, ...
    meas_1 NUMBER, meas_2 NUMBER, ... )
    TABLESPACE TSfacts PARALLEL;
    CREATE OR REPLACE VIEW facts AS SELECT * FROM fact_1 UNION ALL

    SELECT * FROM fact_2 UNION ALL ...
    SELECT * FROM fact_12;

  2. Load the table in parallel.

    We will use 10 processes loading into 30 disks. To accomplish this, the DBA must split the input into 120 files beforehand. The 10 processes will load the first partition in parallel on the first 10 disks, then the second partition in parallel on the second 10 disks, and so on through the twelfth partition. We run the following commands concurrently as background processes:

    			SQLLDR DATA=fact_1.file_1 DIRECT=TRUE PARALLEL=TRUE 
    FILE=/dev/D1
    ...
    SQLLDR DATA=fact_1.file_10 DIRECT=TRUE PARALLEL=TRUE
    FILE=/dev/D10
    WAIT;
    ...
    SQLLDR DATA=fact_12.file_1 DIRECT=TRUE PARALLEL=TRUE
    FILE=/dev/D21
    SQLLDR DATA=fact_12.file_10 DIRECT=TRUE PARALLEL=TRUE
    FILE=/dev/D30

    For Oracle Parallel Server, divide the loader session evenly among the nodes. The data file being read should always reside on the same node as the loader session. NFS mount of the data file on a remote node is not an optimal approach.

    Note: Although this example shows parallel load used with partition views, the two features can be used independent of one another.

  1. Define partitioning criteria.

    The final step in creating a partition view is to add check constraints to the partitions so that the optimizer can skip partitions that are not needed to answer specific queries. For example, if a query scans rows in which dim_2 is between January 1 and February 13, you only need to scan fact_1 and fact_2. The constraints also prevent erroneous inserts and updates.

    We run the following commands concurrently in the background.

    			ALTER TABLE fact_1 ADD CONSTRAINT month_1 
    CHECK (dim_2 BETWEEN `01-01-1995' AND `01-31-1995')
    ...
    ALTER TABLE fact_12 ADD CONSTRAINT month_12
    CHECK (dim_2 BETWEEN `12-01-1995' AND `12-31-1995')


Setting Up Temporary Tablespaces for Parallel Sort and Hash Join

For optimal space management performance you can use the dedicated temporary tablespaces available in release 7.3. As with the TSfacts tablespace, we first add a single datafile and later add 29 more in parallel.

CREATE TABLESPACE TStemp TEMPORARY DATAFILE '/dev/D31' 
SIZE 4096MB REUSE
DEFAULT STORAGE (INITIAL 10MB NEXT 10MB PCTINCREASE 0);


Size of Temporary Extents

Temporary extents should normally be all the same size (to avoid fragmentation), and smaller than permanent extents. As a general rule, temporary extents should be in the range of 1MB to 10MB.

Temporary extents should be smaller than permanent extents because there are more demands for temporary space, and parallel processes or other queries running concurrently must share the temporary tablespace. Once you allocate an extent it is yours for the duration of your operation. If you allocate a large extent but only need to use a small amount of space, the unused space in the extent is tied up.

At the same time, temporary extents should be large enough that processes do not have to spend all their time waiting for space. Although temporary tablespaces use less overhead than permanent tablespaces when allocating and freeing a new extent, obtaining a new temporary extent is not completely free of overhead.


Operating System Striping of Temporary Tablespaces

Operating system striping is an alternative technique you can use with temporary tablespaces. Media recovery, however, offers subtle challenges for large temporary tablespaces. It does not make sense to mirror, use RAID, or back up a temporary tablespace. If you lose a disk in an OS striped temporary space, you will probably have to drop and recreate the tablespace. This could take several hours for our 120 GB example. With Oracle striping, simply remove the bad disk from the tablespace. For example, if /dev/D50 fails, enter:

ALTER DATABASE DATAFILE `/dev/D50' RESIZE 1K;
ALTER DATABASE DATAFILE `/dev/D50' OFFLINE;

Because the dictionary sees the size as 1K, which is less than the extent size, the bad file will never be accessed. Eventually, you may wish to recreate the tablespace because Oracle allows a maximum of 1022 files in the database.

Be sure to make your temporary tablespace available for use:

ALTER USER scott TEMPORARY TABLESPACE TStemp; 

See Also: For MPP systems, see your platform-specific documentation regarding the advisability of disabling disk affinity when using operating system striping.


Creating Indexes in Parallel

After loading data and setting up for sorts, the next step is to create indexes. The considerations for creating the index tablespace are similar to those for other tablespaces. Because indexes are accessed much more randomly than tables and temporary space, OS striping with a small stripe width is often the best choice. Because of the performance advantage of the UNRECOVERABLE option, consider mirroring or recreation in the event of media failure.

Create at least as many files in the tablespace as the degree of parallelism used to create indexes in the tablespace. This will reduce fragmentation.

Because we used partition views, we must create indexes on each partition. Our options are summarized in the following table:
Each Index, One at a Time
All Indexes Concurrently
Serial

Slow

Use scripting language. I/O interference is possible: processes could interfere with each other, repeatedly scanning the table.

Parallel

Recommended: Run each statement, one at a time, parallelized. Spreads data over files.

May overload system capacity

If OS striping is used, we can choose to create indexes one at a time using parallel index creation for each one. Creating all indexes concurrently in parallel would probably overload the capacity of the machine. If Oracle striping is used, we should use parallel index creation for each index so that each index is spread over many disks for high I/O bandwidth (unless partial availability after media failure is the primary goal). To do this we enter:

CREATE TABLESPACE TSidx DATAFILE 'dev/D51' SIZE 4096M
DEFAULT STORAGE (INITIAL 40MB NEXT 40MB PCTINCREASE 0);
CREATE INDEX I1 ON fact_1(dim_1, dim_2, dim_3)
TABLESPACE TSidx PARALLEL UNRECOVERABLE;
CREATE INDEX I2 ON fact_2(dim_1, dim_2, dim_3)
TABLESPACE TSidx PARALLEL UNRECOVERABLE;
...
CREATE INDEX I12 ON fact_12(dim_1, dim_2, dim_3)
TABLESPACE TSidx PARALLEL UNRECOVERABLE;

The PARALLEL clause directs use of the default parallelism (10) to scan facts and to sort and build the index. Twenty query server processes will be used in total.

The UNRECOVERABLE clause specifies that no redo log records are to be written when building the index. Although this speeds up index creation significantly, a media recovery strategy that relies on backups and archived log files will require the DBA to re-issue the CREATE INDEX commands if a disk in Tsidx fails after indexes are created but before they are backed up.

Step 3: Analyzing Data


After the data is loaded and indexed, analyze it. The ANALYZE command does not execute in parallel against a single object. However, many different objects (such as all partitions of a partition view) can be analyzed in parallel.

Note: Cost-based optimization is always used with parallel query and with partition views. You must therefore perform ANALYZE at the partition level with partitioned tables and with parallel query

Queries with many joins are quite sensitive to the accuracy of the statistics. Use the COMPUTE option of the ANALYZE command if possible (it may take quite some time and a large amount of temporary space). If you must use the ESTIMATE option, sample as large a percentage as possible. Use histograms for data which is not uniformly distributed. Note that a great deal of data falls into this classification.

When you analyze a table, the indexes that are defined on that table are also analyzed. To analyze all partitions of facts (including indexes) in parallel, run the following commands concurrently as background processes:

ANALYZE TABLE fact_1 COMPUTE STATISTICS
ANALYZE TABLE fact_2 COMPUTE STATISTICS
...
ANALYZE TABLE fact_12 COMPUTE STATISTICS

It is worthwhile computing or estimating with a larger sample size the indexed columns and indexes themselves, rather than the measure data. The measure data is not used as much: most of the predicates and critical optimizer information comes from the dimensions. A DBA or application designer should know which columns are the most frequently used in predicates.

For example, you might analyze the data in two passes. In the first pass you could obtain some statistics by analyzing one percent of the data. Run the following commands concurrently as background processes:

ANALYZE TABLE fact_1 ESTIMATE STATISTICS SAMPLE 1 PERCENT 
ANALYZE TABLE fact_2 ESTIMATE STATISTICS SAMPLE 1 PERCENT
...
ANALYZE TABLE fact_12 ESTIMATE STATISTICS SAMPLE 1 PERCENT

In a second pass, you could refine statistics for the indexed columns and the index (but not the non-indexed columns):

ANALYZE TABLE fact_1 COMPUTE STATISTICS FOR ALL INDEXED COLUMNS SIZE 1 
ANALYZE TABLE fact_1 COMPUTE STATISTICS FOR ALL INDEXES
ANALYZE TABLE fact_2 COMPUTE STATISTICS FOR ALL INDEXED COLUMNS SIZE 1
ANALYZE TABLE fact_2 COMPUTE STATISTICS FOR ALL INDEXES

...
ANALYZE TABLE fact_12 COMPUTE STATISTICS FOR ALL INDEXED COLUMNS SIZE 1
ANALYZE TABLE fact_12 COMPUTE STATISTICS FOR ALL INDEXES

The result will be a faster plan because you have targeted more important information. You are spending more resources to get good statistics on high value columns (indexes and join columns), and getting baseline statistics for the rest of the data.

Understanding Parallel Query Performance Issues



The Formula for Memory, Users, and Query Servers

Key to parallel query tuning is an understanding of the relationship between memory requirements, number of users (processes) a system can support, and maximum number of query servers. The goal is to obtain the dramatic performance enhancement made possible by parallelizing certain operations, and by using hash joins rather than sort merge joins. This performance goal must often be balanced with the need to support multiple users.

In considering the maximum number of processes a system can support, it is useful to divide the processes into three classes, based on their memory requirements. The maximum number of processes which can fit in memory can then be analyzed as follows:

+ (# low memory processes * low memory required)
+ (# medium memory processes * medium memory required)
+ (# high memory processes * high memory required)
_________________________________
total memory required

Formula for Memory/Users/Server Relationship
In general, if max_processes is much bigger than the number of users, you can consider running parallel queries. If max_processes is considerably less than the number of users, you must consider other alternatives, such as those described in the following section.

Low Memory Processes: 100K to 1MB
These processes include table scans; index lookups; index nested loop joins; single row aggregates (such as sum or average with no GROUP BYs, or very few groups); sorts that return only a few rows; and direct loading.

This class of DSS process is similar to OLTP processes in the amount of memory required. Process memory could be as low as a few hundred kilobytes of fixed overhead. You could potentially support thousands of users performing this kind of operation. You can take this requirement even lower by using the multi-threaded server, and support even more users.

Medium Memory Processes: 1MB to 10MB

This class of process includes large sorts; sort merge join; GROUP BYs or ORDER BYs returning a large number of rows; and index creation.

These processes require the fixed overhead needed by a low memory process, plus one or more sort areas, depending on the query. For example, a typical sort merge join would sort both its inputs--resulting in two sort areas. Group by or order by with many groups or rows also requires sort areas.

Look at the EXPLAIN PLAN output for the query to identify the number and type of joins, and the number and type of sorts. Optimizer statistics in the plan show the size of the operations. When planning joins, remember that you do have a number of choices.

High Memory Processes: 10MB to 100MB

High memory processes include one or more hash joins; or a combination of one or more hash joins with large sorts.

These processes require the fixed overhead needed by a low memory process, plus hash area. The hash area size required might range from 8MB to 32MB, and you might need two of them. If you are performing 2 or more serial hash joins, each process uses 2 hash areas. In a parallel query, since each query server process does at most 1 hash join at a time, you would need 1 hash area size per server.

In summary, the amount of hash join memory for a query equals parallel degree multiplied by hash area size, multiplied by the minimum of either 2, or the number of hash joins in the query.

See Also: "Optimizing Join Statements" on page A-37 for a comparison of hash joins and sort merge joins.


How to Balance the Formula

You can use the following techniques to balance the memory/user/server formula:


Oversubscribe, with Attention to Paging

You can permit the potential workload to exceed the limits recommended in the formula. Total memory required, minus the SGA size, can be multiplied by a factor of 1.2, to allow for 20% oversubscription. Thus, if you have 1G of memory, you might be able to support 1.2G of demand: the other 20% could be handled by the paging system.

Your system may be able to perform acceptably even if oversubscribed by 60%, if on average not all of the processes are performing hash joins concurrently. Users might then try to use more than the available memory, so you must monitor paging activity in such a situation. If paging goes up dramatically, consider another alternative.

On average, no more than 5% of the time should be spent simply waiting in the operating system on page faults. To spend more wait time than this indicates an I/O bound condition of the paging subsystem. Use your operating system monitor to check wait time: The sum of time waiting and time running equals 100%. If you are running close to 100% CPU, then you are not waiting. If you are waiting, it should not be on account of paging.

If wait time for paging devices exceeds 5%, it is a strong indication that you must reduce memory requirements. This could mean reducing the memory required for each class of process, or reducing the number of processes in memory-intensive classes. It could mean adding memory. Or it could indicate an I/O bottleneck in the paging subsystem that you could resolve by striping.

Note: You must verify that a particular degree of oversubscription will be viable on your system by monitoring the paging rate and making sure you are not spending more than a very small percent of the time waiting for the paging subsystem.


Reduce the Number of Memory-intensive Processes

Adjusting the Degree of Parallelism. Not only can you adjust the number of queries that run in parallel, but you can also adjust the degree of parallelism with which queries run. To do this you would issue an ALTER TABLE statement with a PARALLEL clause, or use a hint. See the Oracle7 Server SQL Reference for more information.

You can limit the parallel pool by reducing the value of PARALLEL_MAX_SERVERS. This places a system-level limit on the total amount of parallelism, and is easy to administer. More processes will then be forced to run in serial mode.

Scheduling Parallel Jobs. Rather than reducing parallelism for all operations, you may be able to schedule large parallel batch jobs to run with full parallelism one at a time, rather than concurrently. Queries at the head of the queue would have a fast response time, those at the end of the queue would have a slow response time. Queueing jobs is thus another way to reduce the number of processes but not reduce parallelism; its disadvantage, however, is a certain amount of administrative overhead.


Decrease DSS Memory per Process

Note: The following discussion focuses upon the relationship of HASH_AREA_SIZE to memory, but all the same considerations apply to SORT_AREA_SIZE. The lower bound of SORT_AREA_SIZE, however, is not as critical as the 8 MB recommended minimum HASH_AREA_SIZE.

If every query performs a hash join and a sort, the high memory requirement limits the number of processes you can have. To allow more users to run concurrently you may need to reduce the DSS process memory.

Moving Processes from High to Medium Memory Requirements. You can move a process from the high-memory class to moderate-memory by changing from hash join to merge join. You can use initialization parameters to limit available memory and thus force the optimizer to stay within certain bounds.

You can disable the hash join capability and explicitly enable it for important hash joins you want to run in batch. On a per-instance basis you could set HASH_JOIN_ENABLED to false, and set it to true only on a per-session basis. Conversely, you could set HASH_JOIN_ENABLED to true on a per-instance basis, and make it false for particular sessions.

Alternatively, you can reduce HASH_AREA_SIZE to well below the recommended minimum (for example, to 1-2MB). Then you can let the optimizer choose sort merge join more often (as opposed to telling the optimizer never to use hash joins). In this way, hash join can still be used for small tables: the optimizer has a memory budget within which it can make decisions about which join method to use.

Remember that the recommended parameter values provide the best response time. If you severely limit these values you may see a significant affect on response time.

Moving Processes from High or Medium Memory Requirements to Low Memory Requirements. If you need to support thousands of users, you must create access paths such that queries do not touch much data. Decrease the demand for index joins by creating indexes and/or summary tables. Decrease the demand for GROUP BY sorting by creating summary tables and encouraging users and applications to reference summaries rather than detailed data. Decrease the demand for ORDER BY sorts by creating indexes on frequently sorted columns.


Decrease Parallelism for Multiple Users

In general there is a trade-off between parallelism for fast single user response time, and efficient use of resources for multiple users. For example, a system with 2G of memory and a HASH_AREA_SIZE of 32MB can support about 60 query servers. A 10 CPU machine can support up to 3 concurrent parallel queries (2 * 10 * 3 = 60). In order to support 12 concurrent parallel queries, the DBA could override the default parallelism (reduce it); decrease HASH_AREA_SIZE; buy more memory, or use some combination of these three strategies. For example, the DBA could ALTER TABLE t PARALLEL (DEGREE 5) for all parallel tables t, set HASH_AREA_SIZE to 16M, and increase PARALLEL_MAX_SERVERS to 120. By reducing the memory of each query server by a factor of 2, and reducing the parallelism of a single query by a factor 2, the system can accommodate 2*2 = 4 times more concurrent queries.

The penalty for taking such an approach is that when a single query happens to be running, the system will use just half the CPU resource of the 10 CPU machine. The other half will be idle until another query is started.

To determine whether your system is being fully utilized, you can use one of the graphical system monitors which are available on most operating systems. These monitors often give you a better idea of CPU utilization and system performance than monitoring the execution time of a query. Consult your operating system documentation to determine whether your system supports graphical system monitors.


Examples of Balancing Memory, Users, and Query Servers

The examples in this section show how to evaluate the relationship between memory, users, and query servers, and balance the formula given in Figure 18-8. They show concretely how you might adjust your system workload so as to accommodate the necessary number of processes and users.


Example 1

Assume your system has 1G of memory, and that your users perform ad hoc joins with 3 or more tables. If you need 300MB for the SGA, that leaves 700MB to accommodate processes. If you allow a generous hash area size (32MB) for best performance, then you can support one of the following:

Remember that every parallel, hash, or sort merge join query takes a number of query servers equal to twice the degree of parallelism, and often each individual process of a parallel query uses a lot of memory. Thus you can support many more users by having them run serially, or by having them run with less parallelism.

To server more users, you can drastically reduce hash area size to 2MB. You may then find that the optimizer switches some queries to sort merge join. This configuration can support 17 parallel queries, or 170 serial queries, but response times may be significantly higher than if you were using hash joins.

Notice the trade-off above: by reducing memory per process by a factor of 16, you can increase the number of concurrent users by a factor of 16. Thus the amount of physical memory on the machine imposes another limit on total number of parallel queries you can run involving hash joins and sorts.


Example 2

In a mixed workload example, consider the following user population:

In this situation, you would have to make some choices. You could not allow everyone to run hash joins, even though they outperform sort merge joins--because you do not have the memory to support this level of workload.

You might take 20 query servers, and set HASH_AREA_SIZE to a midrange value, perhaps 20MB, for a single powerful batch job in the high memory class. Twenty servers multiplied by 20MB equals 400MB of memory. (This might be a big GROUP BY with join to produce a summary of data.)

You might plan for 10 analysts running sequential queries that use complex hash joins accessing a large amount of data. (You would not allow them to do parallel queries because of memory requirements.) Ten such sequential processes at 40MB apiece equals 400MB of memory.

Finally, to support hundreds of users doing low memory processes at about 0.5MB apiece, you might reserve 200MB.

You might consider it safe to oversubscribe at 50% because of the infrequent batch jobs during the day. This would give you enough virtual memory for the workload described above (700MB * 1.5 = 1.05GB).


Example 3

Suppose there are 200 query servers and 100 users doing heavy DSS involving hash joins. You decide to leave such tasks as index retrievals and small sorts out of the picture, concentrating on the high memory processes. You might have 300 processes, of which 200 must come from the parallel pool and 100 are single threaded. One quarter of the total 2G of memory might be used by the SGA, leaving 1.5 G of memory to handle all the processes. You could apply the formula considering only the high memory requirements, including a factor of 20% oversubscription:

Formula for Memory/User/Server Relationship: DSS Process Memory
Here, 5 MB = 1.8G/300. Less than 5MB of hash area would be available for each process, whereas 8 MB is the recommended minimum. If you must have 300 processes, you may need to force them to use other join methods in order to change them from the highly memory intensive class to the moderately memory intensive class. Then they may fit within your system's constraints.


Example 4

Consider a system with 2 gigabytes of memory and 10 users who want to run intensive DSS parallel queries concurrently and still have good performance. If you choose parallelism of degree 10, then the 10 users will require 200 processes. (Processes running big joins need twice the number of query servers as the degree of parallelism, so you would set PARALLEL_MAX_SERVERS to 10 * 10 * 2.) In this example each process would get 1.8G/200, or about 9MB of hash area--which should be adequate.

With only 5 users doing large hash joins, each process would get over 16 MB of hash area, which would be fine. But if you want 32 MB available for lots of hash joins, the system could only support 2 or 3 users. By contrast, if users are just computing aggregates the system needs adequate sort area size--and can have many more users.


Example 5

If such a system needs to support 1000 users who must all run big queries you must evaluate the situation carefully. Here, the per user memory budget is only 1.8MB (that is, 1.8G divided by 1,000). Since this figure is at the low end of the medium weight query class, you must rule out parallel query operations, which use even more resources. You must also rule out big hash joins. Each sequential process could require up to 2 hash areas plus sort area, so you would have to set HASH_AREA_SIZE to the same value as SORT_AREA_SIZE, which would be 600K (1.8MB/3). Such a small hash area size is likely to be ineffective, so you may opt to disable hash joins altogether.

Given the organization's resources and business needs, is it reasonable for you to upgrade your system's memory? If memory upgrade is not an option, then you must change your expectations. To adjust the balance you might:


Parallel Query Space Management Issues

This section describes space management issues that come into play when using the parallel query.

These issues become particularly important for parallel query running on a parallel server, where tuning becomes more critical the more nodes involved.


ST Enqueue

Every space management transaction in the database is controlled by a single ST enqueue. A high transaction rate (more than 2 or 3 per minute) on the ST enqueue may result in poor scalability on OPS systems with many nodes, or a timeout waiting for space management resources.


Sorts and Temporary Data

Try to minimize the number of space management transactions, in particular:


Permanent Data in Tables, Indexes, Clusters

Try to minimize the number of sort space management transactions.

Use dedicated temporary tablespaces to optimize space management for sorts. This is particularly beneficial on a parallel server. You can monitor this using V$SORT_SEGMENT.

Set initial and next extent size to a value in the range of 1MB to 10MB. Processes may use temporary space at a rate of up to 1MB per second. Do not accept the default value of 40K for next extent size, because this will result in many requests for space per second.

If you are unable to allocate extents for various reasons, you can recoalesce the space by using the ALTER TABLESPACE COALESCE SPACE command. This should be done on a regular basis for temporary tablespaces in particular.

See Also: "Setting Up Temporary Tablespaces for Parallel Sort and Hash Join" on page 18-22


Optimizing Parallel Query on Oracle Parallel Server


Lock Allocation

This section provides parallel query tuning guidelines for optimal lock management on the parallel server:

To optimize the parallel query on a parallel server, you need to correctly set GC_FILES_TO_LOCKS. On a parallel server a certain number of parallel cache management (PCM) locks are assigned to each data file. DBA locking in its default behavior assigns one lock to each block. During a full table scan a PCM lock must then be acquired for each block read into the scan. To speed up full table scans, you have three possibilities:

The following guidelines impact memory usage, and thus indirectly affect performance:

See Also: Oracle7 Parallel Server Concepts & Administration for a thorough discussion of PCM locks and locking parameters.


Allocation of Processes and Instances

Oracle computes a target degree of parallelism by examining the maximum of the degree for each table and other factors, before runtime. At runtime, a parallel query will be executed sequentially if insufficient query servers are available. PARALLEL_MIN_PERCENT sets the minimum percentage of the target number of query servers which must be available, if the query is to run in parallel. When PARALLEL_MIN_PERCENT is set to n, an error message will be sent if n percent query server processes are not available. If no parallel query processes are available, a parallel query will be executed sequentially.

The parallel query feature assigns each instance a unique number, which is determined by the INSTANCE_NUMBER initialization parameter. The instance number regulates the order of instance startup.


Load Balancing for Multiple Concurrent Parallel Queries

Load balancing is an effort to distribute the parallel query processes to achieve even CPU and memory utilization, and to minimize remote I/O and communication between nodes.

When multiple concurrent queries are running on a single node, load balancing is done by the operating system. For example, if there are 10 CPUs and 5 query servers, the operating system distributes the 5 processes among the CPUs. If a second user is added, the operating system still distributes the workload.

For a parallel server, however, no single operating system performs the load balancing: instead, the parallel query feature performs this function.

ALTER TABLE tablename PARALLEL (INSTANCES 1)

If a query requests more than one instance, allocation priorities involve table caching and disk affinity.

ALTER TABLE tablename PARALLEL (INSTANCES 2)

Thus, if there are 5 query servers, it is advantageous for them to run on as many nodes as possible.


Disk Affinity

Some Oracle Parallel Server platforms use disk affinity: processes are allocated on instances that are closest to the requested data. Without disk affinity, Oracle tries to balance the allocation evenly across instances. Thus, with 10 nodes and 2 users, the parallel query feature will run query 1 on the first 5 nodes and query 2 on the second 5 nodes. The two will not overlap.

With disk affinity, Oracle tries to allocate query servers for parallel table scans on the instances which own the data. Disk affinity exploits a "shared nothing" architecture by minimizing data shipping and internode communication. It can significantly increase parallel query throughput and response time.

Disk affinity is used for parallel table scans and parallel temporary tablespace allocation, but is not used for parallel table creation or parallel index creation. Temporary tablespaces internally try to use storage that is local to an instance. It guarantees optimal space management extent allocation. Optimization is the calculation of disk affinity to achieve best performance. Operating system striping disables disk affinity.

In the following example of disk affinity, table T is distributed across 3 nodes, and a full table scan on table T is being performed.

Disk Affinity Example

Detecting Parallel Query Performance Problems


This section summarizes common tools and techniques you can use to obtain performance feedback on parallel queries.


Diagnosing Problems

Use the following decision tree to diagnose parallel query performance problems. Some key issues are the following:

Parallel Query Performance Checklist

Is There Regression?

Does the parallel query's actual performance deviate from what you expected? If performance is as you expected, can you justify the notion that there is a performance problem? Perhaps you have a desired outcome in mind, to which you are comparing the current outcome. Perhaps you have a justifiable performance expectation which the system is not achieving, You might have achieved this level of performance or particular execution plan in the past, but now, with a similar environment and operation, this is not being met.

If performance is not as you expected, can you quantify the deviation? For decision support queries, the execution plan is key. For critical DSS queries, save the EXPLAIN PLAN results. Then, as you analyze the data, reanalyze, upgrade Oracle, and load in new data over the course of time, you can compare any new execution plan with the old plan. You can take this approach either proactively or reactively.

Alternatively, you may find that you get a plan that works better if you use hints. You may want to understand why hints were necessary, and figure out how to get the optimizer to generate the desired plan without the hints. Try increasing the statistical sample size: better statistics may give you a better plan. If you had to use a PARALLEL hint, look to see whether you had OPTIMIZER_PERCENT_PARALLEL set to 100%.


Is There a Plan Change?

If there has been a change in the execution plan, determine whether the plan is (or should be) parallel or serial.


Is There a Parallel Plan?

If the execution plan is (or should be) parallel:

See Also: Parallel EXPLAIN PLAN tags are defined in "Values of OTHER_TAG Column of the PLAN_TABLE" on page 20-5.

If the execution plan is (or should be) serial, consider the following strategies:

See Also: "CREATE TABLE ... AS SELECT in Parallel" on page C-4.


Is There Parallel Execution?

If the cause of regression cannot be traced to problems in the plan, then the problem must be an execution issue.

For DSS queries, both serial and parallel, consider memory. Check the paging rate and make sure the system is using memory as effectively as possible. Check buffer, sort, and hash area sizing.


Is There Skew?

If parallel execution is occurring, is there unevenness in workload distribution? For example, if there are 10 CPUs and a single user, you can see whether the workload is evenly distributed across CPUs. This may vary over time, with periods that are more or less I/O intensive, but in general each CPU should have roughly the same amount of activity.

The statistics in V$PQ_TQSTAT show rows produced and consumed per query server process. This is a good indication of skew, and does not require single user operation.

Operating system statistics show you the per-processor CPU utilization and per-disk I/O activity. Concurrently running tasks make it harder to see what is going on, however. It would be useful to run in single user mode and check operating system monitors which show system level CPU and I/O activity.

When workload distribution is unbalanced, a common culprit is the presence of skew in the data. For a hash join, this may be the case if the number of distinct values is less than the degree of parallelism. When joining two tables on a column with only 4 distinct values, you will not get scaling on more than 4. If you have 10 CPUs, 4 of them will be saturated but 6 will be idle. To avoid this problem, change the query: use temporary tables to change the join order such that all queries have more values in the join column than the number of CPUs.

If I/O problems occur you may need to reorganize your data, spreading it over more devices. If parallel execution problems occur, check to be sure you have followed the recommendation to spread data over at least as many devices as CPUs.

If there is no skew in workload distribution, check for the following conditions:


Executing Parallel SQL Statements

After analyzing your tables and indexes you should be able to run queries and see speedup that scales linearly with the degree of parallelism used. The following operations should scale:

Start with simple parallel queries. Evaluate total I/O throughput with SELECT COUNT(*) FROM facts. Evaluate total CPU power by adding a complex WHERE clause. I/O imbalance may suggest a better physical database layout. After you understand how simple scans work, add aggregation, joins, and other operations that reflect individual aspects of the overall workload. Look for bottlenecks.

Query performance is not the only thing you must monitor. You should also monitor parallel load and parallel index creation, and look for good utilization of I/O and CPU resources.


Using EXPLAIN PLAN to See How a Query is Parallelized

Use an EXPLAIN PLAN statement to view the sequential and parallel query plan. The OTHER_TAG column of the plan table summarizes how each plan step is parallelized, and describes the text of the query that is used by query servers for each operation. Optimizer cost information for selected plan steps is given in the cost, bytes, and cardinality columns. Table 20-1 on page 20-5 summarizes the meaning of the OTHER_TAG column.

EXPLAIN PLAN thus provides detailed information as to where specific operations are being performed. You can then change the execution plan for better performance. For example, if many steps are serial (where OTHER_TAG is blank, serial to parallel, or parallel to serial), then the query controller could be a bottleneck.

Consider the following example SQL statement, which summarizes sales data by region:

EXPLAIN PLAN SET STATEMENT_ID = `Jan_Summary' FOR
SELECT dim_1 SUM(meas1) FROM facts WHERE dim_2 < `02-01-1995'
GROUP BY dim_1

The following SQL script extracts a compact hierarchical plan from the output of EXPLAIN PLAN:

SELECT
SUBSTR( lpad(' ',2*(level-1)) ||
decode(id, 0, statement_id, operation) ||
' ' || options || ' ' || object_name ||
' ('|| cardinality|| ',' || bytes ||
' ,'|| cost || ')' || other_tag ,
1, 79) "step (card,bytes,cost) par"
from plan_table
start with id = 0
connect by prior id = parent_id
and prior nvl(statement_id,' ') =
nvl(statement_id,' ');

Following is the query plan for "Jan_Summary":

Jan_Summary   (32921, 5695333, 7309)
SORT GROUP BY (,,) PARALLEL_TO_SERIAL
VIEW facts (834569, 134008732, 6360) PARALLEL_TO_PARALLEL
UNION-ALL PARTITION (,,) PARALLEL_COMBINED_WITH_PARENT
TABLE ACCESS FULL fact_1 (815044,119570335, 6000) PARALLEL_COMBINED_WITH_PARENT
FILTER (,,) PARALLEL_COMBINED_WITH_PARENT
TABLE ACCESS FULL fact_2 (,,) PARALLEL_COMBINED_WITH_PARENT
. . .
FILTER (,,) PARALLEL_COMBINED_WITH_PARENT
TABLE ACCESS FULL fact_12 (,,) PARALLEL_COMBINED_WITH_PARENT

As shown by the PARALLEL_TO_PARALLEL keyword, a parallel partition view shows up as a subtree of operations that are all executed as a unit in each parallel query server. In the preceding example, the union-all partition and full scans for all partitions execute as a unit in the same set of parallel processes.

Each parallel query server potentially scans a portion of each partition, rather than each server scanning all of one partition. This provides better load balancing. Some of the full scans are skipped, as shown by the presence of filter nodes.

The following figure illustrates how, with the PARALLEL_TO_PARALLEL keyword, data from the partition view is redistributed to a second set of parallel query servers for parallel grouping. Query server set 1 executes the partition view and its subtree of operations from the preceding example. The rows coming out are repartitioned through the table queue to query server set 2, which executes the group by operation. Because the GROUP BY operation indicates PARALLEL_TO_SERIAL, another table queue collects its results and sends it to the query coordinator, and then to the user.

Data Redistribution among Parallel Query Servers
As a rule, if the PARALLEL_TO_PARALLEL keyword exists, there will be two sets of query servers. This means that for grouping, sort merge, or hash joins, twice the number of parallel query servers will be assigned to the query. This requires redistribution of data or rows from set 1 to set 2. If there is no PARALLEL_TO__TO_PARALLEL keyword, then the query will get just one set of servers. Such serial processes include aggregations, such as COUNT * FROM facts or SELECT facts WHERE DATE = '7/1/94'.

For non-distributed queries, the OBJECT_NODE column gives the name of the table queue. If the PARALLEL_TO_PARALLEL keyword exists, then the EXPLAIN PLAN of the parent operation should have SQL that references the child table queue in its FROM clause. In this way it describes the order in which the output from operations is consumed.

See Also: For more information, see Chapter 20, "The EXPLAIN PLAN Command".


Dynamic Performance Tables

Dynamic performance tables are views of internal Oracle7 data structures and statistics that you can query periodically to monitor progress of a long-running operation. When used in conjunction with data dictionary views, these tables provide a wealth of information. The challenge is visualizing the data and then acting upon it.


V$FILESTAT

This view sums read and write requests, number of blocks, and service times for every datafile in every tablespace. It can help you diagnose I/O problems and workload distribution problems.

The file numbers listed in V$FILESTAT can be joined to those in the DBA_DATA_FILES view to group I/O by tablespace or to find the file name for a given file number. By doing ratio analysis you can find what percentage of the total tablespace activity for each file in the tablespace If you make a practice of putting just one large, heavily accessed object in a tablespace, you can use this technique to identify objects that have a poor physical layout.

You can further diagnose disk space allocation problems using the DBA_EXTENTS view. Ensure that space is allocated evenly from all files in the tablespace. Monitoring V$FILESTAT during a long running operation and correlating I/O activity to the explain plan output is a good way to follow progress.


V$PARAMETER

This view lists the name, current value, and default value of all system parameters. In addition, the view indicates whether the parameter may be modified online with an ALTER SYSTEM or ALTER SESSION command.


V$PQ_SESSTAT

This view is valid only when queried from a session that is executing parallel SQL statements. Thus it cannot be used to monitor a long running operation. It gives summary statistics about the parallel statements executed in the session, including total number of messages exchanged between server processes and the actual degree of parallelism used.


V$PQ_SLAVE

This view tallies the current and total CPU time and number of messages sent and received per query server process. It can be monitored during a long running operation. Verify that there is little variance among processes in CPU usage and number of messages processed. A variance may indicate a load balancing problem. Attempt to correlate the variance to a variance in the base data distribution. Extreme imbalance could indicate that the number of distinct values in a join column is much less than the degree of parallelism. See "Parallel CREATE TABLE AS SELECT" on page C-4 for a possible workaround.


V$PQ_SYSSTAT

The V$PQ_SYSSTAT view aggregates session statistics from all query server processes. It sums the total query server message traffic, and gives the status of the pool of query servers.

This view can help you determine the appropriate number of query server processes for an instance. The statistics that are particularly useful are "Servers Busy", "Servers Idle", "Servers Started", and "Servers Shutdown".

Periodically examine V$PQ_SYSSTAT to determine if the query servers for the instance are actually busy. To determine whether the instance's query servers are active, issue the following query:

SELECT * FROM V$PQ_SYSSTAT 
WHERE statistic = "Servers Busy";

STATISTIC             VALUE
--------------------- -----------
Servers Busy 70


V$PQ_TQSTAT

This view provides a detailed report of message traffic at the level of the table queue. It is valid only when queried from a session that is executing parallel SQL statements. A table queue is the pipeline between query server groups or between the query coordinator and a query server group or between a query server group and the coordinator. These table queues are represented in the query plan by the tags PARALLEL_TO_PARALLEL, SERIAL_TO_PARALLEL, or PARALLEL_TO_SERIAL, respectively.

The view contains a row for each query server process that reads or writes each table queue. A table queue connecting 10 consumers to 10 producers will have 20 rows in the view. Sum the bytes column and group by TQ_ID for the total number of bytes sent through each table queue. Compare this with the optimizer estimates; large variations may indicate a need to analyze the data using a larger sample.

Compute the variance of bytes grouped by TQ_ID. Large variances indicate workload imbalance. Drill down on large variances to determine if the producers start out with unequal distributions of data, or whether the distribution itself is skewed. The latter may indicate a low number of distinct values.

For many of the dynamic performance tables, the system parameter TIMED_STATISTICS must be set to TRUE in order to get the most useful information. You can use ALTER SYSTEM to turn TIMED_STATISTICS on and off dynamically.

See Also: For more information, see Chapter 19, "The Dynamic Performance Tables".


Operating System Statistics

There is considerable overlap between information available in Oracle and information available though operating system utilities (such as sar and vmstat, on UNIX-based systems). Operating systems provide performance statistics on I/O, communication, CPU, memory and paging, scheduling, and synchronization primitives. The Oracle V$SESSTAT view provides the major categories of OS statistics as well.

Typically it is harder to map OS information about I/O devices and semaphore operations back to database objects and operations; on the other hand the OS may have better visualization tools and more efficient means of collecting the data.

OS information about CPU and memory usage is very important for assessing performance. Probably the most important statistic is CPU usage. The goal of low level performance tuning is to become CPU bound on all CPUs. Once this is achieved, you can pop up a level and work at the SQL level to find an alternate plan that is perhaps more I/O intensive but uses less CPU.

OS memory and paging information is valuable for fine tuning the many system parameters that control how memory is divided among memory-hungry warehouse subsystems like parallel communication, sort, and hash join.

Solving Parallel Query Performance Problems



Overriding the Default Degree of Parallelism

The default degree of parallelism is appropriate for reducing response time while guaranteeing use of CPU and I/O resources for any arbitrary query. If an operation is I/O bound, you should consider increasing the default degree of parallelism. If it is memory bound, or there are several concurrent parallel queries, consider decreasing the default degree.

Note: The default degree of parallelism is used for tables that have PARALLEL attributed to them in the data dictionary, or via the PARALLEL hint. If a table does not have parallelism attributed to it, or has NOPARALLEL (the default) attributed to it, then that table is never scanned in parallel--regardless of the default degree of parallelism that would be indicated by the number of CPUs, instances, and devices storing that table.

To override the default degree of parallelism:

  1. Determine the maximum number of query servers your system can support.

  2. Divide the query servers among the estimated number of concurrent queries.

Note the following general guidelines:


Rewriting the SQL

The most important issue for parallel query execution is ensuring that all parts of the query plan that process a substantial amount of data execute in parallel. Use EXPLAIN PLAN and see that all plan steps have an OTHER_TAG of PARALLEL_TO_PARALLEL, PARALLEL_TO_SERIAL, PARALLEL_COMBINED_WITH_PARENT, or PARALLEL_COMBINED_WITH_CHILD. Any other keyword (or null) indicates serial execution, and a possible bottleneck.

By making the following changes you can increase the optimizer's ability to generate parallel plans:


Creating and Populating Tables in Parallel

Oracle cannot return results to a user process in parallel. If a query returns a large number of rows, execution of the query may indeed be faster; however, the user process can only receive the rows serially. To optimize parallel query performance with queries that retrieve large result sets, use PARALLEL CREATE TABLE AS SELECT to store the result set in the database. At a later time, users can view the result set in serial.

To avoid I/O bottlenecks, specify a tablespace with at least as many devices as CPUs. To avoid fragmentation in allocating space, the number of files in a tablespace should be a multiple of the number of CPUs.


Creating Temporary Tables in Parallel

When combined with the UNRECOVERABLE option in Oracle7, the parallel version of the CREATE TABLE AS SELECT statement provides a very efficient temporary table facility. For example,

CREATE TABLE summary PARALLEL UNRECOVERABLE  
AS SELECT dim_1, dim_2 ..., SUM (meas_1) FROM facts
GROUP BY dim_1, dim_2;

You can take advantage of intermediate tables using the following techniques:


Creating Indexes in Parallel

Multiple processes can work together simultaneously to create an index. By dividing the work necessary to create an index among multiple server processes, the Oracle Server can create the index more quickly than if a single server process created the index sequentially.

Parallel index creation works in much the same way as a table scan with an ORDER BY clause. The table is randomly sampled and a set of index keys is found that equally divides the index into the same number of pieces as the degree of parallelism. A first set of query processes scans the table, extracts key,rowid pairs, and sends each pair to a process in a second set of query processes based on key. Each process in the second set sorts the keys and builds an index in the usual fashion. After all index pieces are built, the coordinator simply concatenates the pieces (which are ordered) to form the final index. .

You can optionally specify that no redo logging should occur during index creation. This can significantly improve performance, but temporarily renders the index unrecoverable. Recoverability is restored after the new index is backed up. If your application can tolerate this window where recovery of the index requires it to be recreated, then you should consider using the UNRECOVERABLE option.

By default, the Oracle Server uses the table definition's PARALLEL clause value to determine the number of server processes to use when creating an index. You can override the default number of processes by using the PARALLEL clause in the CREATE INDEX command.

Attention: When creating an index in parallel, the STORAGE clause refers to the storage of each of the subindexes created by the query server processes. Therefore, an index created with an INITIAL of 5MB and a PARALLEL DEGREE of 12 consumes at least 60MB of storage during index creation because each process starts with an extent of 5MB. When the query coordinator process combines the sorted subindexes, some of the extents may be trimmed, and the resulting index may be smaller than the requested 60MB.

When you add or enable a UNIQUE key or PRIMARY KEY constraint on a table, you cannot automatically create the required index in parallel. Instead, manually create an index on the desired columns using the CREATE INDEX command and an appropriate PARALLEL clause and then add or enable the constraint. Oracle then uses the existing index when enabling or adding the constraint.

See Also: For more information on how extents are allocated when using the parallel query feature, Oracle7 Server Concepts.

Refer to the Oracle7 Server SQL Reference for the complete syntax of the CREATE INDEX command.


Using Direct Disk I/O for Sorts with Parallel Query

Setting SORT_DIRECT_WRITES to AUTO causes each process performing large sorts to write its own sort runs to disk. When SORT_DIRECT_WRITES is FALSE, the DBWR process writes sort runs to disk via the buffer cache. Using AUTO provides better scalability for large parallel sorts because it removes contention for shared buffers.


Using Hints with the Cost Based Optimization Approach

Cost-based optimization is a highly sophisticated approach to finding the best execution plan for SQL statements. Cost-based optimization is automatically used with parallel query.

Attention: You must use ANALYZE to gather current statistics for cost-based optimization. In particular, tables used in parallel should always be analyzed.

Use discretion in employing hints. With the parallel aware optimizer cost-based optimization generates such good plans that hints should rarely be necessary. If used at all, hints should come as a final step in tuning, and only when they demonstrate a necessary and significant performance advantage. In such cases, begin with the execution plan recommended by cost-based optimization, and only go on to test the effect of hints after you have quantified your performance expectations.

Remember that hints are powerful; if you use them and the underlying data changes you may need to change the hints. Otherwise, the effectiveness of your execution plans may deteriorate.

Always use cost-based optimization unless you have an existing application that has been hand-tuned for rule-based optimization. If you must use rule-based optimization, rewriting a SQL statement can give orders of magnitude improvements.

See Also: "OPTIMIZER_PERCENT_PARALLEL" on page 18-5. This parameter controls parallel awareness.



Go to previous file in sequence Go to next file in sequence
Prev Next
Oracle
Copyright © 1996 Oracle Corporation.
All Rights Reserved.
Go to Product Documentation Library
Library
Go to books for this product
Product
Go to Contents for this book
Contents
Go to Index
Index