
Successfully Implementing Parallel Processing
There is an old network saying: Bandwidth problems can be cured with money. Latency problems are harder because the speed of light is fixed--you can't bribe God.
David Clark, MIT
To attain the goals of speedup and scaleup, you must effectively implement parallel processing and parallel database technology. This means designing and building your system for parallel processing from the start. This chapter covers the following issues:
The Four Levels of Scalability You Need
Successful implementation of parallel processing and parallel database requires optimal scalability on four levels:
Attention: An inappropriately designed application may not fully utilize the potential scalability of the system. Likewise, no matter how well your application scales, you will not get the desired performance if you try to run it on hardware that does not scale.
Figure 2 - 1. Levels of Scalability
Scalability of Hardware
Interconnect is key to hardware scalability. Every system must have some means of connecting the CPUs, whether it be a high speed bus or a low speed Ethernet connection. Bandwidth and latency of the interconnect determine the scalability of the hardware.
Bandwidth and Latency
Most interconnects have sufficient bandwidth. A high bandwidth may, in fact, disguise high latency.
Hardware scalability depends heavily on very low latency. OPS communications are characterized by a large number of very small messages to and from the DLM.
Consider the difference between conveying a hundred passengers on a single bus, compared to a hundred individual cars. In the latter case, efficiency depends largely upon the capacity for cars to quickly enter and exit the highway.
Disk Input and Output
Local I/Os are faster than remote I/Os ( those which occur between nodes). If a great deal of remote I/O is needed, the system loses scalability. In this case you can partition data so that the data is local. The following example illustrates the difference:
Figure 2 - 2. Local and Remote I/O on Shared Nothing and Shared Disk
Various clustering implementations are available from different hardware vendors. Note that on shared disk clusters with dual ported controllers, there is the same latency from all nodes.
Scalability of Operating System and DLM
The ultimate scalability of your system also depends upon the scalability of the operating system and the DLM. This section explains how to analyze these factors.
Operating System Scalability
Software scalability can be an important issue if one node is a shared memory system (that is, a system in which multiple CPUs connect to a symmetric multiprocessor single memory). Methods of synchronization in the operating system can determine the scalability of the system. In asymmetrical multiprocessing, for example, only a single CPU can handle I/O interrupts. Consider a system in which multiple user processes all need to request a resource from the operating system:
Figure 2 - 3. Asymmetric Multiprocessing vs. Symmetric Multiprocessing
Here, the potential scalability of the hardware is lost because the operating system can only process one resource request at a time. Each time one request enters the operating system, a lock is held to exclude the others. In symmetrical multiprocessing, by contrast, there is no such bottleneck.
DLM Scalability
The vendor-supplied distributed lock manager, which may be close to or built-in to the operating system, also controls scalability. Evaluate the different DLMs using criteria which will affect your ultimate success in parallel processing. Compare the interconnect between DLMs:
- How many concurrent requests can the DLM handle at a time? Some DLMs can only handle one; other DLMs allow concurrent access to different resources.
- How many local lock requests can be handled concurrently? How many remote requests? Some DLMs can handle many local requests, but only one remote request concurrently.
- Find out what communications protocol is used on the interconnect.
Refer to your vendor's DLM guide to learn more about its capabilities and limitations.
Scalability of Database
An important distinction in parallel server architectures is internal versus external parallelism; this has a strong effect on scalability. The key difference is whether the DBMS parallelizes the query, or an external process parallelizes the query.
See Also: "The Parallel Query Option"
for an example of internal versus external parallelism.
Scalability of Application
Application design is key to taking advantage of the scalability of the other elements of the system.
Attention: Applications must be specifically designed to be scalable!
No matter how scalable the hardware, software, and database may be, a table with only one row which every node is updating will synchronize on one datablock. Consider the process of generating a unique sequence number:
UPDATE ORDER_NUM
SET NEXT_ORDER_NUM = NEXT_ORDER_NUM + 1;
COMMIT;
Every node which needs to update this sequence number will have to wait to access the same row of this table: the situation is inherently unscalable. A better approach would be to use sequences to improve scalability:
INSERT INTO ORDERS VALUES
(order_sequence.nextval, ... )
Note: Clients must be connected to server machines in a scalable manner: this means that your network must also be scalable!
See Also: "Designing a Database for Parallel Server"
.
"Application Analysis"
.
When Is Parallel Processing Advantageous?
This section describes applications which commonly benefit from a parallel server.
Data Warehousing Applications
Data warehousing applications which infrequently update, insert, or delete data are often appropriate for the parallel server. Query-intensive applications and other applications with low update activity can access the database through different instances with little additional overhead.
If the data blocks are not modified, multiple copies of the blocks can be read into the Oracle buffer caches on several nodes and queried without additional I/O or lock operations. As long as the instances are only reading data and not modifying it, a block can be read into multiple buffer caches and one instance never has to write the block to disk before another instance can read it.
Decision support applications are good candidates for a parallel server because they only occasionally modify data, as in a database of financial transactions which is mostly accessed by queries during the day and is updated during off-peak hours.
Applications in Which Updated Data Blocks Do Not Overlap
Applications which either update disjoint data blocks or update the same data blocks at different times are also well suited to the parallel server. Applications can run efficiently on a parallel server if the set of data blocks regularly updated by one instance does not overlap with the set of blocks simultaneously updated by other instances. An example is a time-sharing environment where each user primarily owns and uses one set of tables.
An instance which needs to update blocks held in its buffer cache must hold one or more distributed locks in exclusive mode while modifying those buffers. You should tune a parallel server and the applications which run on it, so as to reduce contention for instance locks.
OLTP with Partitioned Data
Online transaction processing applications which modify disjoint sets of data benefit the most from the parallel server architecture. One example is a branch banking system where each branch (node) accesses its own accounts and only occasionally accesses accounts from other branches.
OLTP with Random Access to a Large Database
Applications which access a database in a mostly random pattern also benefit from the parallel server architecture, if the database is significantly larger than any node's buffer cache. One example is a Department of Motor Vehicles system where individual records are unlikely to be accessed by different nodes at the same time. Another example would be archived tax records or research data. In cases like these, most of the accesses would result in I/O even if the instance had exclusive access to the database. Oracle features such as fine grained locking can further improve performance of such applications.
Departmentalized Applications
Applications which primarily modify different tables in the same database are also suitable for OPS. An example is a system where one node is dedicated to inventory processing, another is dedicated to personnel processing, and a third is dedicated to sales processing. Note that there is only one database to administer, not three.
Summary
The figure below illustrates the relative scalability of different kinds of applications. Online transaction processing applications which have a very high volume of inserts or updates from multiple nodes on the same set of data may require partitioning if they are to scale well. OLTP applications with a very low insert and update load may not require partitioning at all to be successful.
Figure 2 - 4. Scalability of Applications
When Is Parallel Processing Not Advantageous?
The following guidelines describe situations in which parallel processing is not advantageous.
If many users on a large number of nodes are modifying a small set of data, then synchronization is likely to be very high. However, if they are just reading the data then no synchronization is required.
- Parallel processing is not advantageous when there is contention between instances on a single block or row.
For example, it would not be effective to use a table with one row used primarily as a sequence numbering tool. Such a table would be a bottleneck because every process would have to select the row, update it, and release it sequentially.
Guidelines for Effective Partitioning
This section provides general guidelines to make partitioning decisions which will decrease synchronization and add to your system's performance.
Overview
You can partition any of the three elements of processing, depending on function, location, and so on, such that they do not interfere with each other. These elements are:
You can partition data, based on groups of users who access it; partition applications into groups which access the same data. You can also consider partitioning by location (geographic partitioning).
Vertical Partitioning
With vertical partitioning, a large number of tasks can run on a large number of resources without much synchronization.
The following figure illustrates the concept of vertical partitioning.
Figure 2 - 5. Vertical Partitioning
Here, a company's accounts payable and accounts receivable functions have been partitioned by users, application, and data. They have been placed on two separate nodes. Here, most of the synchronization takes place on the same node, which is very efficient. The cost of synchronization on the local node is cheaper than the cost of synchronization between nodes.
Partition tasks on a subset of resources to reduce synchronization. When you partition, you have a smaller set of tasks working on a smaller resource.
Horizontal Partitioning
The following example illustrates the concept of horizontal partitioning.
Figure 2 - 6 represents the rows of a stock table. If the Oracle Parallel Server has four instances on a single node, then the data can be partitioned such that each instance accesses only a subset of the data.
Figure 2 - 6. Horizontal Partitioning
In this example, very little synchronization is necessary because the instances access different sets of rows.
Similarly, users partitioned by location can often run almost independently: very little synchronization is necessary if the users do not access the same data.
Common Misconceptions about Parallel Processing
Various mistaken notions can lead to unrealistic expectations about parallel processing. Consider the following:
- Do not assume that you can switch to parallel processing and it will automatically work the way you expect. A good deal of application tuning and database design and tuning is required.
- Scalability is not determined just by the number of nodes or CPUs involved, but also by interconnect (bandwidth/latency) and by the amount and cost of synchronization.
In some applications a single synchronization may be so expensive as to constitute a problem; in other applications, many cheap synchronizations may be perfectly acceptable.
For example, on some MPP systems if one of the CPUs dies, the whole machine dies. On a cluster, by contrast, if one of the nodes dies the other nodes survive.
- Remember that all applications may not have been designed to scale up effectively.