ALPHA ARCHITECTURE TECHNICAL SUMMARY 
Dick Sites, Rich Witek


[NOTE: "Alpha" is an internal code name. An official name will be announced
 soon.]


WHAT IS ALPHA?

Alpha is a 64-bit RISC architecture, designed with particular emphasis on 
speed, multiple instruction issue, multiple processors, software migration 
from VAX VMS and MIPS ULTRIX, and long lifetime. The architects rejected 
any feature that did not appear to be usable for at least 25 years.

The first chip implementation runs at up to 200 MHz.  The speed of Alpha 
implementations is expected to scale up from this by at least a factor of 
1000 over the next 25 years. 

FORMATS

Data Formats

Alpha is a load/store RISC architecture with all operations done between
registers. Alpha has 32 integer registers and 32 floating registers, each
64 bits. Integer register R31 and floating register F31 are always zero.
Longword (32-bit) and quadword (64-bit) integers are supported. Four
floating datatypes are supported: VAX F-float, VAX G-float, IEEE single
(32-bit), and IEEE double (64-bit). Memory is accessed via 64-bit virtual
little-endian byte addresses. 

Instruction Formats

Alpha instructions are all 32 bits, in four different instruction formats
specifying 0, 1, 2, or 3 register fields. All formats have a 6-bit opcode. 

	+-----+-------------------------+
	| OP  |		number		| PALcall
	+-----+----+--------------------+
	| OP  | RA |	disp		| Branch
	+-----+----+----+---------------+
	| OP  | RA | RB |    disp	| Memory
	+-----+----+----+----------+----+
	| OP  | RA | RB |  func.   | RC	| Operate
	+-----+----+----+----------+----+

PALcalls specify one of a few dozen complex operations to be performed.

Conditional branches test register RA and specify a signed 21-bit
PC-relative longword target displacement. Subroutine calls put the return
address in RA. 

Loads and stores move longwords or quadwords between RA and memory, using 
RB plus a signed 16-bit displacement as the memory address.

Operates use source registers RA and RB, writing result register RC. There 
is an extended opcode in the 11-bit function field. Integer operates can use 
the RB field and part of the function field to specify an 8-bit 
zero-extended literal.

INSTRUCTIONS

PALcall Instructions

The Privileged Architecture Library call instructions specify one of a few
dozen complex functions to be performed. These functions deal with
interrupts and exceptions, task switching, virtual memory, and other
complex operations that must be done atomically. PALcall instructions
vector to a privileged library of software subroutines (using the same Alpha 
instruction set) that implement an operating-system-specific set of these 
complex operations. 

Branch Instructions

Conditional branch instructions can test a register for positive/negative
or for zero/nonzero. They can also test integer registers for even/odd. 
Unconditional branch instructions can write a return address into a 
register. There is also a calculated jump instruction the branches to an 
arbitrary 64-bit address in a register.

Load/Store Instructions

Load and store instructions can move either 32- or 64-bit aligned
quantities. The VAX floating-point load/store instructions swap words to
give a consistent register format for floats. Memory addresses are flat
64-bit virtual addresses, with no segmentation. A 32-bit integer datum is
placed in a register in a canonical form that makes 33 copies of the high
bit of the datum. A 32-bit floating datum is placed in a register in a
canonical form that extends the exponent by 3 bits and extends the fraction
with 29 low-order zeros. 32-bit operates preserve these canonical forms. 

There are no 8- or 16-bit load/store instructions, but there are facilities 
for doing byte manipulation in registers.

Alpha has no 32/64 mode bit or other such device. Compilers, as directed by 
user declarations, can generate any mixture of 32- and 64-bit operations.

Integer Operate Instructions

The integer operate instructions manipulate full 64-bit values, and include
the usual assortment of arithmetic, compare, logical, and shift
instructions. There are just three 32-bit integer operates: add, subtract,
and multiply. These differ from their 64-bit counterparts ONLY in overflow
detection and in producing 32-bit canonical results. 

There is no integer divide instruction.

In addition to the operations found in conventional RISC architectures,
there are scaled add/subtract for quick subscript calculation, 128-bit
multiply for division by a constant and multiprecision arithmetic,
conditional moves for avoiding branches, and an extensive set of
in-register byte manipulation instructions for avoiding single-byte writes.

Rather then keeping a global state bit for integer overflow trap enable,
the enable is encoded in the function field of each instruction. Thus, both
ADDQ/V and ADDQ opcodes exist for specifying 64-bit add with and without
overflow checking. This makes pipelined implementations easier.

Floating-point Operate Instructions

The floating operate instructions include four complete sets of VAX and
IEEE arithmetic, plus conversions between float and integer. 

There is no floating square root instruction.

In addition to the operations found in conventional RISC architectures, 
there are conditional moves for avoiding branches, and merge sign/exponent 
instructions for simple field manipulation.

Rather then keeping global state bits for arithmetic trap enables and
rounding mode, these enable and mode bits are encoded in the function field
of each instruction. 


SIGNIFICANT DIFFERENCES BETWEEN ALPHA AND CONVENTIONAL RISC PROCESSORS

First, Alpha is a true 64-bit architecture, with a minimal number of 32-bit 
instructions. It is not a 32-bit architecture that was later expanded to 64
bits. 

Second, Alpha was designed to allow very high-speed implementations. The
instructions are very simple (no load-four-registers-unaligned-and-check-
for-bytes-of-zero). There are no special registers that would prevent
pipelining multiple instances of the same operations (no MQ register and no
condition codes). The instructions interact with each other ONLY by one
instruction writing a register or memory, and another one reading from the
same place. This makes it particularly easy to build implementations that
issue multiple instructions every CPU cycle. (The first implementation
in fact issues two instructions every cycle.) There are no
implementation-specific pipeline timing hazards, no load-delay slots, and
no branch-delay slots. These features would make it difficult to maintain
binary compatibility across multiple implementations and difficult to
maintain full speed on multiple-issue implementations. 
 
Alpha is unconventional in the approach to byte manipulation. Single-byte
stores found in conventional RISC architectures force cache and memory
implementations to include byte shift-and-mask logic, and sequencer logic
to perform read-modify-write on memory words. This approach is awkward to
implement quickly, and tends to slow down cache access to normal 32- or
64-bit aligned quantities. It also makes it awkward to build a high-speed
error-correcting write-back cache, which is often needed to keep a very
fast RISC implementation busy. It also can make it difficult to pipeline
multiple byte operations. 

Instead, the byte shifting and masking is done in Alpha with normal 64-bit
register-to-register instructions, crafted to keep the sequences short.

Alpha is also unconventional in the approach to arithmetic traps. In
contrast to conventional RISC architectures, Alpha arithmetic traps
(overflow, underflow, etc.) are imprecise -- they can be delivered an
arbitrary number of instructions after the instruction that triggered the
trap, and traps from many different instructions can be reported at once.
This makes implementations that use pipelining and multiple issue
substantially easier to build. 

If precise arithmetic exceptions are desired, trap barrier instructions can
be explicitly inserted in the program to force traps to be delivered at
specific points. 

Alpha is also unconventional in the approach to multiprocessor shared
memory. As viewed from a second processor (including an I/O device), a 
sequence of reads and writes issued by one processor may be arbitrarily 
reordered by an implementation. This allows implementations to use 
multi-bank caches, bypassed write buffers, write merging, pipelined writes 
with retry on error, etc. If strict ordering between two accesses must be
maintained, memory barrier instructions can be explicitly inserted in the
program. 

The basic multiprocessor interlocking primitive is a RISC-style
load_locked, modify, store_conditional sequence. If the sequence runs
without interrupt, exception, or an interfering write from another
processor, then the conditional store succeeds. Otherwise, the store fails
and the program eventually must branch back and retry the sequence. This
style of interlocking scales well with very fast caches, and makes Alpha an
especially attractive architecture for building multiple-processor systems.

Alpha includes a number of HINTS for implementations, all aimed at allowing 
higher speed. Calculated jumps have a target hint that can allow much 
faster subroutine calls and returns. There are prefetching hints for the 
memory system that can allow much higher cache hit rates. There are also
granularity hints for the virtual-address mapping that can allow much more 
effective use of translation lookaside buffers for big contiguous 
structures.

Alpha includes a very flexible privileged library of software for operating-
system-specific operations, invoked with PALcalls. This library allows Alpha
to run full VMS using one version of this software library that mirrors many
of the VAX operating-system features, and to run OSF/1 using a different
version that mirrors many of the MIPS operating-system features, and
similarly for NT. Other versions could be tailored for real-time, teaching,
etc. The PALcalls allow Alpha to run VMS with hardly more hardware than
a a conventional RISC machine has (the PAL mode bit itself, plus 4 extra
protection bits in each TB entry). This library makes Alpha an especially
attractive architecture for multiple operating systems. 

Finally, Alpha is not strongly biased toward only one or two programming 
languages. It is an attractive architecture for compiling at least a dozen 
different languages.


SUMMARY

Alpha is designed to be a leadership 64-bit architecture.

--------------------
    Specifications (150MHz version).

    Process Technology          .75 micron CMOS 

    Cycle Time                   150 MHz (6.6 ns)

    Die Size                     13.9mm x 16.8mm

    Transistor Count             1.68 million

    Package                      431 pin PGA

    Number of Signal Pins        291

    Power Dissipation            23 W at 6.6 ns cycle

    Power Supply                 3.3 volts

    Clocking Input               300 MHz differential 

    On-chip D-cache              8 Kbyte, physical, direct-mapped,
                                 write-through, 32-byte line, 32-byte fill

    On-chip I-cache              8 Kbyte, physical, direct-mapped,
                                 32-byte line, 32-byte fill, 64 ASNs

    On-chip DTB                  32-entry; fully-associative; 8-Kbyte,
                                 64-Kbyte, 256-Kbyte, 4-Mbyte page sizes

    On-chip ITB                  8-entry, fully associative, 8-Kbyte page
                                 plus 4-entry, fully-associative, 4-Mbyte page

    Floating Point Unit          On-chip FPU supports both IEEE and VAX
                                 floating point

    Bus                          Separate data and address bus.
                                 128-bit/64-bit data bus

    Serial ROM Interface         Allows the chip to directly
                                 access serial ROM

    Virtual Address Size         64 bits checked; 43 bits
                                 implemented

    Physical Address Size        34 bits implemented

    Page Size                    8 Kbytes

    Issue Rate                   2 instructions per cycle to A-box,
                                 E-box, or F-box

    Integer Pipeline             7-stage pipeline

    Floating Pipeline            10-stage pipeline