ALPHA ARCHITECTURE TECHNICAL SUMMARY Dick Sites, Rich Witek [NOTE: "Alpha" is an internal code name. An official name will be announced soon.] WHAT IS ALPHA? Alpha is a 64-bit RISC architecture, designed with particular emphasis on speed, multiple instruction issue, multiple processors, software migration from VAX VMS and MIPS ULTRIX, and long lifetime. The architects rejected any feature that did not appear to be usable for at least 25 years. The first chip implementation runs at up to 200 MHz. The speed of Alpha implementations is expected to scale up from this by at least a factor of 1000 over the next 25 years. FORMATS Data Formats Alpha is a load/store RISC architecture with all operations done between registers. Alpha has 32 integer registers and 32 floating registers, each 64 bits. Integer register R31 and floating register F31 are always zero. Longword (32-bit) and quadword (64-bit) integers are supported. Four floating datatypes are supported: VAX F-float, VAX G-float, IEEE single (32-bit), and IEEE double (64-bit). Memory is accessed via 64-bit virtual little-endian byte addresses. Instruction Formats Alpha instructions are all 32 bits, in four different instruction formats specifying 0, 1, 2, or 3 register fields. All formats have a 6-bit opcode. +-----+-------------------------+ | OP | number | PALcall +-----+----+--------------------+ | OP | RA | disp | Branch +-----+----+----+---------------+ | OP | RA | RB | disp | Memory +-----+----+----+----------+----+ | OP | RA | RB | func. | RC | Operate +-----+----+----+----------+----+ PALcalls specify one of a few dozen complex operations to be performed. Conditional branches test register RA and specify a signed 21-bit PC-relative longword target displacement. Subroutine calls put the return address in RA. Loads and stores move longwords or quadwords between RA and memory, using RB plus a signed 16-bit displacement as the memory address. Operates use source registers RA and RB, writing result register RC. There is an extended opcode in the 11-bit function field. Integer operates can use the RB field and part of the function field to specify an 8-bit zero-extended literal. INSTRUCTIONS PALcall Instructions The Privileged Architecture Library call instructions specify one of a few dozen complex functions to be performed. These functions deal with interrupts and exceptions, task switching, virtual memory, and other complex operations that must be done atomically. PALcall instructions vector to a privileged library of software subroutines (using the same Alpha instruction set) that implement an operating-system-specific set of these complex operations. Branch Instructions Conditional branch instructions can test a register for positive/negative or for zero/nonzero. They can also test integer registers for even/odd. Unconditional branch instructions can write a return address into a register. There is also a calculated jump instruction the branches to an arbitrary 64-bit address in a register. Load/Store Instructions Load and store instructions can move either 32- or 64-bit aligned quantities. The VAX floating-point load/store instructions swap words to give a consistent register format for floats. Memory addresses are flat 64-bit virtual addresses, with no segmentation. A 32-bit integer datum is placed in a register in a canonical form that makes 33 copies of the high bit of the datum. A 32-bit floating datum is placed in a register in a canonical form that extends the exponent by 3 bits and extends the fraction with 29 low-order zeros. 32-bit operates preserve these canonical forms. There are no 8- or 16-bit load/store instructions, but there are facilities for doing byte manipulation in registers. Alpha has no 32/64 mode bit or other such device. Compilers, as directed by user declarations, can generate any mixture of 32- and 64-bit operations. Integer Operate Instructions The integer operate instructions manipulate full 64-bit values, and include the usual assortment of arithmetic, compare, logical, and shift instructions. There are just three 32-bit integer operates: add, subtract, and multiply. These differ from their 64-bit counterparts ONLY in overflow detection and in producing 32-bit canonical results. There is no integer divide instruction. In addition to the operations found in conventional RISC architectures, there are scaled add/subtract for quick subscript calculation, 128-bit multiply for division by a constant and multiprecision arithmetic, conditional moves for avoiding branches, and an extensive set of in-register byte manipulation instructions for avoiding single-byte writes. Rather then keeping a global state bit for integer overflow trap enable, the enable is encoded in the function field of each instruction. Thus, both ADDQ/V and ADDQ opcodes exist for specifying 64-bit add with and without overflow checking. This makes pipelined implementations easier. Floating-point Operate Instructions The floating operate instructions include four complete sets of VAX and IEEE arithmetic, plus conversions between float and integer. There is no floating square root instruction. In addition to the operations found in conventional RISC architectures, there are conditional moves for avoiding branches, and merge sign/exponent instructions for simple field manipulation. Rather then keeping global state bits for arithmetic trap enables and rounding mode, these enable and mode bits are encoded in the function field of each instruction. SIGNIFICANT DIFFERENCES BETWEEN ALPHA AND CONVENTIONAL RISC PROCESSORS First, Alpha is a true 64-bit architecture, with a minimal number of 32-bit instructions. It is not a 32-bit architecture that was later expanded to 64 bits. Second, Alpha was designed to allow very high-speed implementations. The instructions are very simple (no load-four-registers-unaligned-and-check- for-bytes-of-zero). There are no special registers that would prevent pipelining multiple instances of the same operations (no MQ register and no condition codes). The instructions interact with each other ONLY by one instruction writing a register or memory, and another one reading from the same place. This makes it particularly easy to build implementations that issue multiple instructions every CPU cycle. (The first implementation in fact issues two instructions every cycle.) There are no implementation-specific pipeline timing hazards, no load-delay slots, and no branch-delay slots. These features would make it difficult to maintain binary compatibility across multiple implementations and difficult to maintain full speed on multiple-issue implementations. Alpha is unconventional in the approach to byte manipulation. Single-byte stores found in conventional RISC architectures force cache and memory implementations to include byte shift-and-mask logic, and sequencer logic to perform read-modify-write on memory words. This approach is awkward to implement quickly, and tends to slow down cache access to normal 32- or 64-bit aligned quantities. It also makes it awkward to build a high-speed error-correcting write-back cache, which is often needed to keep a very fast RISC implementation busy. It also can make it difficult to pipeline multiple byte operations. Instead, the byte shifting and masking is done in Alpha with normal 64-bit register-to-register instructions, crafted to keep the sequences short. Alpha is also unconventional in the approach to arithmetic traps. In contrast to conventional RISC architectures, Alpha arithmetic traps (overflow, underflow, etc.) are imprecise -- they can be delivered an arbitrary number of instructions after the instruction that triggered the trap, and traps from many different instructions can be reported at once. This makes implementations that use pipelining and multiple issue substantially easier to build. If precise arithmetic exceptions are desired, trap barrier instructions can be explicitly inserted in the program to force traps to be delivered at specific points. Alpha is also unconventional in the approach to multiprocessor shared memory. As viewed from a second processor (including an I/O device), a sequence of reads and writes issued by one processor may be arbitrarily reordered by an implementation. This allows implementations to use multi-bank caches, bypassed write buffers, write merging, pipelined writes with retry on error, etc. If strict ordering between two accesses must be maintained, memory barrier instructions can be explicitly inserted in the program. The basic multiprocessor interlocking primitive is a RISC-style load_locked, modify, store_conditional sequence. If the sequence runs without interrupt, exception, or an interfering write from another processor, then the conditional store succeeds. Otherwise, the store fails and the program eventually must branch back and retry the sequence. This style of interlocking scales well with very fast caches, and makes Alpha an especially attractive architecture for building multiple-processor systems. Alpha includes a number of HINTS for implementations, all aimed at allowing higher speed. Calculated jumps have a target hint that can allow much faster subroutine calls and returns. There are prefetching hints for the memory system that can allow much higher cache hit rates. There are also granularity hints for the virtual-address mapping that can allow much more effective use of translation lookaside buffers for big contiguous structures. Alpha includes a very flexible privileged library of software for operating- system-specific operations, invoked with PALcalls. This library allows Alpha to run full VMS using one version of this software library that mirrors many of the VAX operating-system features, and to run OSF/1 using a different version that mirrors many of the MIPS operating-system features, and similarly for NT. Other versions could be tailored for real-time, teaching, etc. The PALcalls allow Alpha to run VMS with hardly more hardware than a a conventional RISC machine has (the PAL mode bit itself, plus 4 extra protection bits in each TB entry). This library makes Alpha an especially attractive architecture for multiple operating systems. Finally, Alpha is not strongly biased toward only one or two programming languages. It is an attractive architecture for compiling at least a dozen different languages. SUMMARY Alpha is designed to be a leadership 64-bit architecture. -------------------- Specifications (150MHz version). Process Technology .75 micron CMOS Cycle Time 150 MHz (6.6 ns) Die Size 13.9mm x 16.8mm Transistor Count 1.68 million Package 431 pin PGA Number of Signal Pins 291 Power Dissipation 23 W at 6.6 ns cycle Power Supply 3.3 volts Clocking Input 300 MHz differential On-chip D-cache 8 Kbyte, physical, direct-mapped, write-through, 32-byte line, 32-byte fill On-chip I-cache 8 Kbyte, physical, direct-mapped, 32-byte line, 32-byte fill, 64 ASNs On-chip DTB 32-entry; fully-associative; 8-Kbyte, 64-Kbyte, 256-Kbyte, 4-Mbyte page sizes On-chip ITB 8-entry, fully associative, 8-Kbyte page plus 4-entry, fully-associative, 4-Mbyte page Floating Point Unit On-chip FPU supports both IEEE and VAX floating point Bus Separate data and address bus. 128-bit/64-bit data bus Serial ROM Interface Allows the chip to directly access serial ROM Virtual Address Size 64 bits checked; 43 bits implemented Physical Address Size 34 bits implemented Page Size 8 Kbytes Issue Rate 2 instructions per cycle to A-box, E-box, or F-box Integer Pipeline 7-stage pipeline Floating Pipeline 10-stage pipeline