% ALPHA ARCHITECTURE TECHNICAL SUMMARY   Dick Sites, Rich Witek    K [NOTE: "Alpha" is an internal code name. An official name will be announced   soon.]      WHAT IS ALPHA?  J Alpha is a 64-bit RISC architecture, designed with particular emphasis on K speed, multiple instruction issue, multiple processors, software migration  I from VAX VMS and MIPS ULTRIX, and long lifetime. The architects rejected  C any feature that did not appear to be usable for at least 25 years.   I The first chip implementation runs at up to 200 MHz.  The speed of Alpha  J implementations is expected to scale up from this by at least a factor of  1000 over the next 25 years.     FORMATS    Data Formats  H Alpha is a load/store RISC architecture with all operations done betweenI registers. Alpha has 32 integer registers and 32 floating registers, each H 64 bits. Integer register R31 and floating register F31 are always zero.D Longword (32-bit) and quadword (64-bit) integers are supported. FourG floating datatypes are supported: VAX F-float, VAX G-float, IEEE single I (32-bit), and IEEE double (64-bit). Memory is accessed via 64-bit virtual  little-endian byte addresses.    Instruction Formats   I Alpha instructions are all 32 bits, in four different instruction formats K specifying 0, 1, 2, or 3 register fields. All formats have a 6-bit opcode.    " 	+-----+-------------------------+ 	| OP  |		number		| PALcall " 	+-----+----+--------------------+ 	| OP  | RA |	disp		| Branch" 	+-----+----+----+---------------+# 	| OP  | RA | RB |    disp	| Memory " 	+-----+----+----+----------+----+* 	| OP  | RA | RB |  func.   | RC	| Operate" 	+-----+----+----+----------+----+  G PALcalls specify one of a few dozen complex operations to be performed.   A Conditional branches test register RA and specify a signed 21-bit I PC-relative longword target displacement. Subroutine calls put the return  address in RA.    J Loads and stores move longwords or quadwords between RA and memory, using ; RB plus a signed 16-bit displacement as the memory address.   K Operates use source registers RA and RB, writing result register RC. There  M is an extended opcode in the 11-bit function field. Integer operates can use  @ the RB field and part of the function field to specify an 8-bit  zero-extended literal.   INSTRUCTIONS   PALcall Instructions  J The Privileged Architecture Library call instructions specify one of a fewB dozen complex functions to be performed. These functions deal withD interrupts and exceptions, task switching, virtual memory, and otherE complex operations that must be done atomically. PALcall instructions M vector to a privileged library of software subroutines (using the same Alpha  J instruction set) that implement an operating-system-specific set of these  complex operations.    Branch Instructions   I Conditional branch instructions can test a register for positive/negative H or for zero/nonzero. They can also test integer registers for even/odd. D Unconditional branch instructions can write a return address into a I register. There is also a calculated jump instruction the branches to an  ' arbitrary 64-bit address in a register.    Load/Store Instructions   A Load and store instructions can move either 32- or 64-bit aligned H quantities. The VAX floating-point load/store instructions swap words toG give a consistent register format for floats. Memory addresses are flat I 64-bit virtual addresses, with no segmentation. A 32-bit integer datum is I placed in a register in a canonical form that makes 33 copies of the high F bit of the datum. A 32-bit floating datum is placed in a register in aK canonical form that extends the exponent by 3 bits and extends the fraction I with 29 low-order zeros. 32-bit operates preserve these canonical forms.    L There are no 8- or 16-bit load/store instructions, but there are facilities ) for doing byte manipulation in registers.   L Alpha has no 32/64 mode bit or other such device. Compilers, as directed by I user declarations, can generate any mixture of 32- and 64-bit operations.    Integer Operate Instructions  K The integer operate instructions manipulate full 64-bit values, and include ? the usual assortment of arithmetic, compare, logical, and shift J instructions. There are just three 32-bit integer operates: add, subtract,J and multiply. These differ from their 64-bit counterparts ONLY in overflow5 detection and in producing 32-bit canonical results.    ' There is no integer divide instruction.   G In addition to the operations found in conventional RISC architectures, F there are scaled add/subtract for quick subscript calculation, 128-bitB multiply for division by a constant and multiprecision arithmetic,@ conditional moves for avoiding branches, and an extensive set ofK in-register byte manipulation instructions for avoiding single-byte writes.   H Rather then keeping a global state bit for integer overflow trap enable,K the enable is encoded in the function field of each instruction. Thus, both H ADDQ/V and ADDQ opcodes exist for specifying 64-bit add with and without? overflow checking. This makes pipelined implementations easier.   # Floating-point Operate Instructions   G The floating operate instructions include four complete sets of VAX and = IEEE arithmetic, plus conversions between float and integer.    - There is no floating square root instruction.   H In addition to the operations found in conventional RISC architectures, K there are conditional moves for avoiding branches, and merge sign/exponent  + instructions for simple field manipulation.   E Rather then keeping global state bits for arithmetic trap enables and K rounding mode, these enable and mode bits are encoded in the function field  of each instruction.      F SIGNIFICANT DIFFERENCES BETWEEN ALPHA AND CONVENTIONAL RISC PROCESSORS  L First, Alpha is a true 64-bit architecture, with a minimal number of 32-bit K instructions. It is not a 32-bit architecture that was later expanded to 64  bits.   H Second, Alpha was designed to allow very high-speed implementations. TheI instructions are very simple (no load-four-registers-unaligned-and-check- E for-bytes-of-zero). There are no special registers that would prevent K pipelining multiple instances of the same operations (no MQ register and no G condition codes). The instructions interact with each other ONLY by one J instruction writing a register or memory, and another one reading from theI same place. This makes it particularly easy to build implementations that F issue multiple instructions every CPU cycle. (The first implementation: in fact issues two instructions every cycle.) There are noI implementation-specific pipeline timing hazards, no load-delay slots, and I no branch-delay slots. These features would make it difficult to maintain E binary compatibility across multiple implementations and difficult to 7 maintain full speed on multiple-issue implementations.     I Alpha is unconventional in the approach to byte manipulation. Single-byte F stores found in conventional RISC architectures force cache and memoryI implementations to include byte shift-and-mask logic, and sequencer logic I to perform read-modify-write on memory words. This approach is awkward to G implement quickly, and tends to slow down cache access to normal 32- or I 64-bit aligned quantities. It also makes it awkward to build a high-speed G error-correcting write-back cache, which is often needed to keep a very H fast RISC implementation busy. It also can make it difficult to pipeline multiple byte operations.   J Instead, the byte shifting and masking is done in Alpha with normal 64-bitG register-to-register instructions, crafted to keep the sequences short.   D Alpha is also unconventional in the approach to arithmetic traps. InC contrast to conventional RISC architectures, Alpha arithmetic traps E (overflow, underflow, etc.) are imprecise -- they can be delivered an I arbitrary number of instructions after the instruction that triggered the I trap, and traps from many different instructions can be reported at once. A This makes implementations that use pipelining and multiple issue  substantially easier to build.    K If precise arithmetic exceptions are desired, trap barrier instructions can G be explicitly inserted in the program to force traps to be delivered at  specific points.    E Alpha is also unconventional in the approach to multiprocessor shared G memory. As viewed from a second processor (including an I/O device), a  H sequence of reads and writes issued by one processor may be arbitrarily C reordered by an implementation. This allows implementations to use  K multi-bank caches, bypassed write buffers, write merging, pipelined writes  I with retry on error, etc. If strict ordering between two accesses must be I maintained, memory barrier instructions can be explicitly inserted in the 	 program.    ? The basic multiprocessor interlocking primitive is a RISC-style E load_locked, modify, store_conditional sequence. If the sequence runs B without interrupt, exception, or an interfering write from anotherJ processor, then the conditional store succeeds. Otherwise, the store failsH and the program eventually must branch back and retry the sequence. ThisK style of interlocking scales well with very fast caches, and makes Alpha an K especially attractive architecture for building multiple-processor systems.   L Alpha includes a number of HINTS for implementations, all aimed at allowing F higher speed. Calculated jumps have a target hint that can allow much I faster subroutine calls and returns. There are prefetching hints for the  H memory system that can allow much higher cache hit rates. There are alsoK granularity hints for the virtual-address mapping that can allow much more  B effective use of translation lookaside buffers for big contiguous  structures.   L Alpha includes a very flexible privileged library of software for operating-L system-specific operations, invoked with PALcalls. This library allows AlphaL to run full VMS using one version of this software library that mirrors manyH of the VAX operating-system features, and to run OSF/1 using a differentD version that mirrors many of the MIPS operating-system features, andK similarly for NT. Other versions could be tailored for real-time, teaching, G etc. The PALcalls allow Alpha to run VMS with hardly more hardware than H a a conventional RISC machine has (the PAL mode bit itself, plus 4 extraI protection bits in each TB entry). This library makes Alpha an especially 8 attractive architecture for multiple operating systems.   I Finally, Alpha is not strongly biased toward only one or two programming  K languages. It is an attractive architecture for compiling at least a dozen   different languages.     SUMMARY   9 Alpha is designed to be a leadership 64-bit architecture.    --------------------$     Specifications (150MHz version).  0     Process Technology          .75 micron CMOS   1     Cycle Time                   150 MHz (6.6 ns)   0     Die Size                     13.9mm x 16.8mm  -     Transistor Count             1.68 million   ,     Package                      431 pin PGA  $     Number of Signal Pins        291  5     Power Dissipation            23 W at 6.6 ns cycle   *     Power Supply                 3.3 volts  6     Clocking Input               300 MHz differential   B     On-chip D-cache              8 Kbyte, physical, direct-mapped,J                                  write-through, 32-byte line, 32-byte fill  B     On-chip I-cache              8 Kbyte, physical, direct-mapped,D                                  32-byte line, 32-byte fill, 64 ASNs  F     On-chip DTB                  32-entry; fully-associative; 8-Kbyte,H                                  64-Kbyte, 256-Kbyte, 4-Mbyte page sizes  I     On-chip ITB                  8-entry, fully associative, 8-Kbyte page N                                  plus 4-entry, fully-associative, 4-Mbyte page  G     Floating Point Unit          On-chip FPU supports both IEEE and VAX /                                  floating point   ?     Bus                          Separate data and address bus. 8                                  128-bit/64-bit data bus  <     Serial ROM Interface         Allows the chip to directly2                                  access serial ROM  9     Virtual Address Size         64 bits checked; 43 bits ,                                  implemented  4     Physical Address Size        34 bits implemented  )     Page Size                    8 Kbytes   C     Issue Rate                   2 instructions per cycle to A-box, 0                                  E-box, or F-box  1     Integer Pipeline             7-stage pipeline   2     Floating Pipeline            10-stage pipeline  