IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
     Home      Products      Services & solutions      Support & downloads      My account     
  IBM Wikis > AIX > Home > POWER5 Architecture
AIX Log In | Sign Up   View a printable version of the current page.
POWER5 Architecture
Added by Nicolette McFadden, last edited by Nicolette McFadden on Feb 14, 2006  (view change)
Labels: 
(None)

This section discusses the POWER5 processor architecture and the differences to its predecessor, the POWER4 processor.

Table of Contents

The POWER5 chip

The POWER5 chip features single and simultaneous multi-threaded execution, providing higher performance in the single-threaded mode than its POWER4 predecessor at equivalent frequencies. The POWER5 processor maintains both binary and architectural compatibility with existing POWER4 processor-based systems and is designed to ensure that binaries continue executing properly and application optimizations carry forward to newer systems.

The POWER5 design provides additional enhancements such as virtualization, reliability, availability, and serviceability (RAS) features at both chip and system levels.

Key enhancements introduced into the POWER5 processor and system design points include:

  • Designed for entry and high-end servers
  • Simultaneous multi-threading
  • Dynamic resource balancing to efficiently allocate system resources to each thread
  • Software-controlled thread prioritization
  • Dynamic power management to reduce power consumption without affecting performance
  • Micro-Partitioning technology (hardware support for Shared Processor Partitions)
  • Virtual storage, virtual Ethernet
  • Enhanced scalability, parallelism
  • Enhanced memory subsystem
  • Improved performance
  • Compatibility with existing POWER4 systems
  • Enhanced reliability, availability, serviceability

Chip overview

This installation can not generate thumbnails: no image support in Java runtime

Featuring single- and multithreaded execution, the POWER5 provides higher performance in the single-threaded mode than its POWER4 predecessor at equivalent frequencies. Enhancements include dynamic resource balancing to efficiently allocate system resources to each thread, software-controlled thread prioritization, and dynamic power management to reduce power consumption without affecting performance.

The POWER5 processor supports the 64-bit PowerPC architecture. A single die contains two identical processor cores, each supporting two logical threads. This architecture makes the chip appear as a four-way symmetric multiprocessor to the operating system.

Each processor core has a separate 64 KB level-one (L1) instruction cache and a 32 KB L1 data cache. The L1 cache is shared by the two hardware threads of the processor core. Both the processor cores in a chip share a 1.88 MB unified level-two (L2) cache. The processor chip houses a level-three (L3) cache controller, which provides for a L3 cache directory on the chip. However, the L3 cache itself is on a separate Merged Logic DRAM (MLD) cache chip. The L3 is a 36 MB victim cache of the L2 cache. The L3 cache is shared by both the processor cores of the POWER5 chip. The L2 and L3 caches are shared by all the hardware threads of both processor cores on the chip.

Unlike POWER4, which was specifically aimed at high-end server applications, design features of POWER5 are targeted at a broad range of applications from low-end 1-2-way servers to high-end 64-way super-servers.
SMPLink is a very low latency switchless interconnect technology that allows nodes to be interconnected as flat SMPs. The actual SMPLink ports come directly off of the POWER5 chip. When connected, the SMPLinks provide a direct path between each POWER5 chip.

With the introduction of SMT, more instructions execute per cycle per processor core, thus increasing the core's and the chip's total switching power.

POWER5 was design to maintain both binary and structural compatibility with existing POWER4 systems to ensure that binaries continue executing properly and all application optimizations carry forward to newer systems.
The rest of the improvements and new features, such as enhancements to the memory subsystem and SMT, are discussed on later charts.

Differences and changes between the POWER4 and POWER5 processor

This installation can not generate thumbnails: no image support in Java runtime

This chart shows the layout of the POWER4 and POWER5 chips.

  POWER4 design POWER5 design
L1 data cache 2-way set associative FIFO a 4-way set associative LRU b
L2 cache 8-way set associative 1.44 MB 10-way set associative 1.9 MB
L3 cache 32 MB (118 clock cycles) 36 MB (~80 clock cycles)
Memory bandwidth 4 GB/second per chip ~16 GB/second per chip
Simultaneous multi-threading No Yes
Processor addressing 1 processor 1/10th of processor
Dynamic power management No Yes
Size 412 mm 2 389 mm 2

a. FIFO stands for First In First Out
b. LRU stands for Least Recently Used

Type Size Heat consumption # transistors
POWER4 415 mm 2 115W @ 1.1 GHz, 156W @ 1.3 GHz 174M transistors
POWER4+ 267 mm 2 75W @ 1.2 GHz, 95W @ 1.45 GHz, 125W @ 1.7 GHz 184M transistors
POWER5 389 mm 2 167W @ 1.65 GHz 276M transistors

Enhanced memory subsystem

This installation can not generate thumbnails: no image support in Java runtime

  • Improved L1 cache design
    • 2-way set associative i-cache
    • 4-way set associative d-cache
    • New replacement algorithm (LRU vs. FIFO)
  • Larger L2 cache
    • 1.9 MB, 10-way set associative
  • Improved L3 cache design
    • 36 MB, 12-way set associative
    • L3 on the processor side of the fabric
    • Satisfies L2 cache misses more frequently
    • Avoids traffic on the interchip fabric
  • On-chip L3 directory and memory controller
    • L3 directory on the chip reduces off-chip delays after an L2 miss
    • Reduced memory latencies
  • Improved pre-fetch algorithms

The L1 instruction cache is 2-way set associative with LRU (Least Recently Used) replacement policy. The L1 Instruction cache is also kept coherent with the L2 cache. The L1 data cache is 4-way set associative with LRU replacement policy. The L1 data cache is a store-through design. It never holds modified data.

The POWER5 L2 cache is accessed by both cores of the chip. It maintains full hardware coherence within the system and can supply intervention data to cores on other POWER5 chips. L2 is an in-line cache, unlike L1s, which are store-through. It is fully inclusive of the two L1 data caches and L1 instruction caches (one L1 data and instruction cache per core).

The 1.88 MB (1,920 KB) L2 is physically implemented in three slices, each 640 KB in size. Each of these three slices have separate L2 cache controllers. Either processor core of the chip can independently access each L2 controller. The L2 slices are 10-way set-associative. 10-way set associativity (vs. 8-way on POWER4) helps to reduce cache contention by allowing more potential storage locations for a given cache line.

L3 is a unified 36 MB cache accessed by both cores on the POWER5 processor chip. It maintains full hardware coherence with the system and can supply intervention data to cores on other POWER5 processor chips. Logically, L3 is an inline cache. Actually, L3 is a victim cache of the L2 - that is, all valid cache lines evicted out of the L2 due to associativity (victimized) will be cast out to L3. The L3 is not inclusive of L2; the same line will never reside in both L2 and L3 at the same time. The L3 cache is implemented off-chip as a separate MLD cache chip, but its directory is on the processor chip itself. This helps the processor check the directory after an L2 miss without experiencing off-chip delays. The L3 cache in POWER5 is on the processor side and not on the memory side of the fabric as in POWER4. This is well depicted in the previous chart. This design lets the POWER5 satisfy L2 cache misses more frequently, with hits on the off chip 36 MB MLD L3, thus avoiding traffic on the interchip fabric. References to data not on the on chip L2 cause the system to check the L3 cache before sending requests onto the interchip fabric.

The memory controller is also on the POWER5 chip and helps to reduce memory latencies by eliminating driver and receiver delays to an external controller.

System structure of POWER4- and POWER5-based systems

This installation can not generate thumbnails: no image support in Java runtime

The figure shows the high-level structures of POWER4- and POWER5-based systems. The POWER4 handles up to a 32-way symmetric multiprocessor. Going beyond 32 processors increases interprocessor communication, resulting in high traffic on the interconnection fabric. This can cause greater contention and negatively affect system scalability. Moving the level-three (L3) cache from the memory side to the processor side of the fabric, allows POWER5 to satisfy level-two (L2) cache misses more frequently, with hits in the 36 MB off-chip L3 cache, and avoiding traffic on the interchip fabric. References to data not resident in the on-chip L2 cache cause the system to check the L3 cache before sending requests on to the interconnection fabric. Moving the L3 cache provides significantly more cache on the processor side than previously available, thus reducing traffic on the fabric and allowing POWER5-based systems to scale to higher levels of symmetric multiprocessing. Initial POWER5 systems support 64 physical processors.

The POWER4 includes a 1.41 MB on-chip L2 cache. POWER4+ chips are similar in design to the POWER4, but are fabricated in 130 nm technology rather than the POWER4's 180 nm technology. The POWER4+ includes a 1.5 MB on-chip L2 cache, whereas the POWER5 supports a 1.875 MB on-chip L2 cache. POWER4 and POWER4+ systems both have 32 MB L3 caches, whereas POWER5 systems have a 36 MB L3 cache.

The L3 cache operates as a backdoor with separate buses for reads and writes that operate at half processor speed. In POWER4 and POWER4+ systems, the L3 was an inline cache for data retrieved from memory. Because of the higher transistor density of the POWER5's 130 nm technology, memory controller was moved on chip and eliminated a chip previously needed for the memory controller function. These two changes in the POWER5 also have the significant side benefits of reducing latency to the L3 cache and main memory, as well as reducing the number of chips necessary to build a system.

Simultaneous Multi-Threading (SMT)

Simultaneous Multi-Threading is a new technology which is part of the POWER5 architecture. You need to know how it works and what benefits it can provide. It is not a cure-all! In this topic we will discuss the evolution of SMT, its function and some guidelines for appropriate use in solution design.

POWER4 instruction pipeline

This installation can not generate thumbnails: no image support in Java runtime

The POWER4 microprocessor is a high-frequency, speculative superscalar machine with out-of-order instruction execution capabilities. Eight independent execution units are capable of executing instructions in parallel, providing a significant performance attribute known as superscalar execution. These include two identical floating-point execution units, each capable of completing a multiply/add instruction each cycle (for a total of four floating-point operations per cycle), two load-store execution units, two fixed-point execution units, a branch execution unit, and a conditional register unit used to perform logical operations on the condition register.
To keep these execution units supplied with work, each processor can fetch up to eight instructions per cycle and can dispatch and complete instructions at a rate of up to five per cycle. A processor is capable of tracking over 200 instructions in-flight at any point in time. Instructions may issue and execute out-of-order with respect to the initial instruction stream, but are carefully tracked so as to complete in program order. In addition, instructions may execute speculatively to improve performance when accurate predictions can be made about conditional scenarios.
The figure in this chart depicts the POWER4 processor execution pipeline. The deeply pipelined structure of the machine's design is shown. Each small box represents a stage of the pipeline (a stage is the logic that is performed in a single processor cycle). Note that there is a common pipeline which first handles instruction fetching and group formation, and this then divides into four different pipelines corresponding to four of the five types of execution units in the machine (the CR execution unit is not shown, which is similar to the fixed-point execution unit). All pipelines have a common termination stage, which is the group completion (CP) stage.

Instruction fetch, group formation, and dispatch: The instructions that make up a program are read in from storage and are executed by the processor. During each cycle, up to eight instructions may be fetched from cache according to the address in the instruction fetch address register (IFAR) and the fetched instructions are scanned for branches (corresponding to the IF, IC, and BP stages in the figure).

Since instructions may be executed out of order, it is necessary to keep track of the program order of all instructions in-flight. In the POWER4 microprocessor, instructions are tracked in groups of one to five instructions rather than as individual instructions. Groups are formed in the pipeline stages D0, D1, D2, and D3. This requires breaking some of the more complex PowerPC instructions down into two or more simpler instructions.

Multithreading Evolution

This installation can not generate thumbnails: no image support in Java runtime

Modern processors have multiple specialized execution units, each of which is capable of handling a small subset of the instruction set architecture - some will handle integer operations, some floating point, and so on. These execution units are capable of operating in parallel and so several instructions of a program may be executing simultaneously.

However, conventional processors execute instructions from a single instruction stream. Despite microarchitectural advances, execution unit utilization remains low in today's microprocessors. It is not unusual to see average execution unit utilization rates of approximately 25% across a broad spectrum of environments. To increase execution unit utilization, designers use thread-level parallelism, in which the physical processor core executes instructions from more than one instruction stream. To the operating system, the physical processor core appears as if it is a symmetric multiprocessor containing two logical processors.

There are at least three different methods for handling multiple threads:

  • Coarse-grained multi-threading
  • Fine-grained multi-threading
  • Simultaneous multi-threading (SMT)

These methods are described next.

Coarse-grained Multi-Threading

This installation can not generate thumbnails: no image support in Java runtime

In coarse-grained multi-threading, only one thread executes at any instance. When a thread encounters a long-latency event, such as a cache miss, the hardware swaps in a second thread to use the machine's resources, rather than letting the machine remain idle. By allowing other work to use what otherwise would be idle cycles, this scheme increases overall system throughput. To conserve resources, both threads share many system resources, such as architectural registers. Hence, swapping program control from one thread to another requires several cycles. IBM implemented coarse-grained multi-threading in the IBM pSeries Model 680.

Coarse-grained multi-threading was introduced in IBM's Star series of processors (for example, the RS64-IV, available in the S85) to improve system performance for many workloads. A multi-threaded processor improves the resource utilization of a processor core by running several hardware threads in parallel. For the Star series, the number of concurrent threads was two.

The basic idea is that when one or more threads of a processor are stalled on a long latency event (for example, waiting on a cache miss), other threads try to keep the core busy. However, AIX needed to be aware of the difference between logical and physical processors and had the responsibility for making sure that each logical processor had a dispatchable thread - even to the point of creating idle threads.

Warning

Check the following paragraph, might not be customer-ready!

Note that coarse-grained multi-threading was never widely used by customers. Partly this was due to the fact that it was not enabled by default and required a reboot to activate it. Another reason was that performance was variable and could, in fact, have a negative impact. For workloads with high thread:processor ratios (for example, TPC-C), HMT can deliver ~20% increased performance. In other workloads, for example, Business Intelligence, where the thread:processor ratio is <2:1, then AIX must create dummy threads for the processor context switch to take place. Switching to/from these dummy threads cost about six machine cycles, whereas without Coarse-grained multi-threading being active, AIX would not have performed a context switch at all. The other disadvantage of Coarse-grained multi-threading was that it disabled Dynamic CPU Deallocation.

Fine-grained Multi-Threading

This installation can not generate thumbnails: no image support in Java runtime

A variant of coarse-grained multi-threading is fine-grained multi-threading. Machines of this class execute threads in successive cycles, in round-robin fashion. Accommodating this design requires duplicate hardware facilities. When a thread encounters a long-latency event, its cycles remain unused. POWER4 processors implemented an SMP on a chip, but are not considered fine-grained multi-threading.

POWER5 instruction pipeline

This installation can not generate thumbnails: no image support in Java runtime

The POWER5 processor core supports both enhanced SMT and single-threaded (ST) operation modes. This chart shows the POWER5's instruction pipeline, which is identical to the POWER4's. All pipeline latencies in the POWER5, including the branch misprediction penalty and load-to-use latency with an L1 data cache hit, are the same as in the POWER4. The identical pipeline structure lets optimizations designed for POWER4-based systems perform equally well on POWER5-based systems. In SMT mode, the POWER5 uses two separate instruction fetch address registers to store the program counters for the two threads. Instruction fetches (IF stage) alternate between the two threads. In ST mode, the POWER5 uses only one program counter and can fetch instructions for that thread every cycle. It can fetch up to eight instructions from the instruction cache (IC stage) every cycle. The two threads share the instruction cache and the instruction translation facility. In a given cycle, all fetched instructions come from the same thread.
Some differences are:

  • There are 120 physical general purpose registers (GPRs) and 120 physical floating-point registers (FPRs).
  • In a single-treaded operation, the POWER5 makes all physical registers available to the single thread, allowing higher instruction-level parallelism.
  • Two groups can commit per cycle, one from each thread.
  • The L1 instruction and data caches are the same size as in the POWER4 64 KB and 32 KB but their associativity has doubled to two- and four-way. The first-level data translation table is now fully associative, but the size remains at 128 entries.

Simultaneous Multi-Threading

This installation can not generate thumbnails: no image support in Java runtime

In simultaneous multi-threading (SMT), as in other multithreaded implementations, the processor fetches instructions from more than one thread. What differentiates this implementation is its ability to schedule instructions for execution from all threads concurrently. With SMT, the system dynamically adjusts to the environment, allowing instructions to execute from each thread if possible, and allowing instructions from one thread to utilize all the execution units if the other thread encounters a long latency event. The POWER5 design implements two-way SMT on each of the chip's two processor cores. Although a higher level of multi-threading is possible, our simulations showed that the added complexity was unjustified. As designers add simultaneous threads to a single physical processor, the marginal performance benefit decreases. In fact, additional multi-threading might decrease performance because of cache thrashing, as data from one thread displaces data needed by another thread.

THIS IS Page 19, TECHNOLOGY PPT

  • Each chip appears as a 4-way SMP to software
    • Allows instructions from two threads to execute simultaneously
  • Processor resources optimized for enhanced SMT performance
    • No context switching, no dummy threads
  • Hardware, POWER Hypervisor, or OS controlled thread priority
    • Dynamic feedback of shared resources allows for balanced thread execution
  • Dynamic switching between single and multithreaded mode

'''Which Workloads are Likely to Benefit From Simultaneous Multi-threading?'''

This is a very difficult question to answer, because the performance benefit of simultaneous multi-threading is workload dependent. Most measurements of commercial workloads have received a 25-40% boost and a few have been even greater. These measurements were taken in a dedicated partition. Simultaneous multi-threading is also expected to help shared processor partitions. The extra threads give the partition a boost after it is dispatched, because they enable the partition to recover its working set quicker. Subsequently, they perform like they would in a dedicated partition. It may be somewhat non-intuitive, but simultaneous multi-threading is at its best, when the performance of the cache is at its worst.

The question may also be answered with the following generalities. Any workload where the majority of individual software threads highly utilize any resource in the processor or memory will benefit little from simultaneous multi-threading. For example, workloads that are heavily floating-point intensive are likely to gain little from simultaneous multi-threading and are the ones most likely to lose performance. They tend to heavily utilize either the floating-point units or the memory bandwidth, while workloads that have a very high Cycles Per Instruction (CPI) count tend to utilize processor and memory resources poorly and usually see the greatest simultaneous multi-threading benefit. These large CPIs are usually caused by high cache miss rates from a very large working set. Large commercial workloads typically have this characteristic, although it is somewhat dependent upon whether the two hardware threads share instructions or data or are completely distinct. Workloads that share instructions or data, which would include those that run a lot in the operating system or within a single application, tend to have better SMT benefits. Workloads with low CPI and low cache miss rates tend to see a benefit, but a smaller one.

THIS IS Page 20, TECHNOLOGY PPT, FIGURE MISSING

The objective of dynamic resource balancing is to ensure that the two threads executing on the same processor flow smoothly through the system. Dynamic resource-balancing logic monitors resources such as the GCT and the load miss queue to determine if one thread is hogging resources. For example, if one thread encounters multiple L2 cache load misses, dependent instructions can back up in the issue queues, preventing additional groups from dispatching and slowing down the other thread. To prevent this, resource-balancing logic detects that a thread has reached a threshold of L2 cache misses and throttles that thread. The other thread can then flow through the machine without encountering congestion from the stalled thread. The POWER5 resource balancing logic also monitors how many GCT entries each thread is using. If one thread starts to use too many GCT entries, the resource balancing logic throttles it back to prevent its blocking the other thread. Depending on the situation, the POWER5 resource-balancing logic has three thread-throttling mechanisms:

  • Reducing the thread's priority
  • Inhibiting the thread's instruction decoding until the congestion clears
  • Flushing all the thread's instructions that are waiting for dispatch and holding the thread's decoding until the congestion clears

THIS IS Page 21, TECHNOLOGY PPT, FIGURE MISSING

  • Instances when unbalanced execution is desirable
    • No work for opposite thread
    • Thread waiting on lock
    • Software determined non uniform balance
    • Power management
  • Control instruction decode rate
    • Software/hardware controls eight priority levels for each thread

Adjustable thread priority lets software determine when one thread should have a greater (or lesser) share of execution resources. (All software layers operating systems, middleware, and applications can set the thread priority. Some priority levels are reserved for setting by a privileged instruction only.) Reasons for choosing an imbalanced thread priority include the following:

  • A thread is in a spin loop waiting for a lock.
  • A thread has no immediate work to do and is waiting in an idle loop.
  • One application must run faster than another.

The POWER5 microprocessor supports eight software-controlled priority levels for each thread. Level 0 is in effect when a thread is not running. Levels 1 (the lowest) through 7 apply to running threads. The POWER5 chip observes the difference in priority levels between the two threads and gives the one with higher priority additional decode cycles. The figure shows how the difference in thread priority affects the relative performance of each thread. If both threads are at the lowest running priority (level 1), the microprocessor assumes that neither thread is doing meaningful work and throttles the decode rate to conserve power.

THIS IS Page 22, TECHNOLOGY PPT, FIGURE MISSING

  • Advantageous for execution unit limited applications
    • Floating or fixed point intensive workloads
  • Execution unit limited applications provide minimal performance leverage for SMT
    • Extra resources necessary for SMT provide higher performance benefit when dedicated to single thread
  • Determined dynamically on a per processor basis

Not all applications benefit from SMT. Having two threads executing on the same processor will not increase the performance of applications with execution-unit-limited performance or applications that consume all the chip's memory bandwidth. For this reason, the POWER5 supports the ST execution mode. In this mode, the POWER5 gives all the physical resources, including the GPR and FPR rename pools, to the active thread, allowing it to achieve higher performance than a POWER4 system at equivalent frequencies. The POWER5 supports two types of Single-threaded operation: An inactive thread can be in either a dormant or a null state. From a hardware perspective, the only difference between these states is whether or not the thread awakens on an external or decrementer interrupt. In the dormant state, the operating system boots up in SMT mode, but instructs the hardware to put the thread into the dormant state when there is no work for that thread. To make a dormant thread active, either the active thread executes a special instruction or an external or decrementer interrupt targets the dormant thread. The hardware detects these scenarios and changes the dormant thread to the active state. It is the software's responsibility to restore the architected state of a thread transitioning from the dormant to the active state. When a thread is in the null state, the operating system is unaware of the thread's existence. As in the dormant state, the operating system does not allocate resources to a null thread. This mode is advantageous if all the system's executing tasks perform better in ST mode.

Simultaneous Multi-Threading (SMT)

As a requirement for performance improvements at the application level, simultaneous multi-threading functionality is embedded in the POWER5 chip technology. Applications developed to use process-level parallelism (multi-tasking) and thread-level parallelism (multi-threads) can shorten their overall execution time. Simultaneous multi-threading is the next stage of processor saturation for throughput-oriented applications to introduce the method of instruction-level parallelism to support multiple pipelines to the processor.

The simultaneous multi-threading mode maximizes the usage of the execution units. In the POWER5 chip, more rename registers have been introduced (for floating-point operation, rename registers increased to 120), which are essential for out-of-order execution, and then vital for simultaneous multi-threading.

If simultaneous multi-threading is activated:

  • More instructions can be executed at the same time.
  • The operating system views twice the number of physical processors installed in the system.
  • Provides support in mixed environments:
    • Capped and uncapped partitions
    • Virtual partitions
    • Dedicated partitions
    • Single partition systems

Note: Simultaneous multi-threading is supported on POWER5 processor-based systems running Linux operating system-based systems at an appropriate level.

The simultaneous multi-threading policy is controlled by the operating system and is thus partition specific.

For Linux, an additional boot option must be set to activate simultaneous multi-threading after a reboot.

Simultaneous Multi-Threading Features

To improve simultaneous multi-threading performance for various workloads and provide robust quality of service, the POWER5 processor provides two features:

  • Dynamic resource balancing
    Dynamic resource balancing is designed to ensure that the two threads executing on the same processor flow smoothly through the system. Depending on the situation, the POWER5 processor resource balancing logic has different thread throttling mechanisms (a thread reached threshold of L2 cache misses will be throttled to allow other threads to pass the stalled thread).
  • Adjustable thread priority
    Adjustable thread priority that allows software to determine when one thread should have a greater (or lesser) share of execution resources. The POWER5 processor supports eight software-controlled priority levels for each thread.

Single threading operation

Having threads executing on the same processor will not increase the performance of applications with execution unit limited performance, or applications that consume all the chip's memory bandwidth. For this reason, the POWER5 processor supports the single threading execution mode. In this mode, the POWER5 processor gives all the physical resources to the active thread, allowing it to achieve higher performance than a POWER4 processor based-system at equivalent frequencies. Highly optimized scientific codes are one example where a single threading operation may provide more throughput.

Dynamic power management

In current Complementary Metal Oxide Semiconductor (CMOS) technologies, chip power is one of the most important design parameters. With the introduction of simultaneous multi-threading, more instructions execute per cycle per processor core, thus increasing the core's and the chip's total switching power. To reduce switching power, POWER5 chips use a fine-grained, dynamic clock gating mechanism extensively. This mechanism gates off clocks to a local clock buffer if dynamic power management logic knows the set of latches driven by the buffer will not be used in the next cycle. This allows substantial power saving with no performance impact. In every cycle, the dynamic power management logic determines whether a local clock buffer that drives a set of latches can be clock gated in the next cycle.

In addition to the switching power, leakage power has become a performance limiter. To reduce leakage power, the POWER5 chip uses transistors with low threshold voltage only in critical paths. The POWER5 chip also has a low-power mode, enabled when the system software instructs the hardware to execute both threads at the lowest available priority. In low power mode, instructions are dispatched once every 32 cycles at most, further reducing switching power. The POWER5 chip uses this mode only when there is no ready task to run on either thread.

Powered by Atlassian Confluence, the Enterprise Wiki. (Version: 2.2.10 Build:#528 Nov 29, 2006)
    About IBM Privacy Contact