Friday, April 5, 2002

Design of a Simple RISC Processor

Fadi Yared and I built this for our Elec 374, Digital Systems Engineering course at Queen's University.

Objective

The purpose of the project is to design and implement a fully functional processor with an assortment of common processor instructions. The design would be completed and simulated through the use of the Altera Max+PlusII CAD software system. The processor (also called Mini SRC) is to be implemented on a FLEX 10k FPGA chip. The instruction set is of type (RISC) and is scaled down to the use of 2 general-purpose registers named R1 and R2. A further objective is to create a 3-Bus architecture for the design spec. with each Bus width being 8 bits long. This is done in hopes of reducing the number of cycles per instruction (CPI) and optimizing system performance. Further functionality in addition to common ALU and memory access instructions are: Sub routine calls, system stack, and interrupt support. For a detailed description of the Instruction Set for the Mini SRC see the opcode specifications included in this report.

Designing a Simple Processor (Mini SRC):
- design, simulate, implement, and verify a small processor.
- design is to be made using the Altera Max+PlusII CAD software system.
- processor should be implemented on the FLEX 10k FPGA chip.

Properties of the Design:
- 8-bit machine
- two general-purpose registers named R0 and R1
- 8-bit data paths
- minimum goal is a 1-bus architecture
- capable of addressing up to 256 bytes of memory
- all instructions are 8 bits long
- Arithmetic and Logic Unit (ALU) that performs 5 operations: Add, Subtract, Increment by 1, Shift Right 1 bit, and Logical AND
- support 12 instructions: Load, Store, Load Immediate/extended, Store extended, Add, Subtract, Branch, Shift, AND, No-operation, and Stop.
- instructions encoded into a 4-bit field at the higher-order end of an instruction

Processor State:
- PC<7..0>: 8-bit register named Program Counter (PC)
- IR<7..0>: 8-bit register named Instruction Register (IR)
- R[0..1]<7..0>: two 8-bit general purpose registers named R[0] and R[1]
- Run: 1-bit run/halt indicator
- Start: Start signal
- Reset: Reset signal

Memory State:
- M[0..255]<7..0>: 256 1-byte words of memory

Additional Features:
- multi-bus architecture
- new instructions (NEG, OR, INPUT, OUTPUT CALL, RETURN etc.)
- stack
- support for interrupt handling

Phases of Design:
- Design and test the Data Path and the ALU using Functional Simulation.
- Add logic for selecting R0 and R1 from the ra, rb, rc fields in the instructions and add logic for evaluation whether or not to follow a branch. Also implement the memory interface design. Test using Functional Simulation.
- Design and test the Control Unit using Functional Simulation.
- Integrate the Data Path and Control Unit into a single design and tested using both Functional Simulation and Timing Simulation for an implementation in a FLEX 10k FPGA chip.

Instruction Set and Opcode Specification

In many opcodes, bit0 is used to distinguish between two actions such as Add and Subtract. The notation 1/0 is used. I.e.: a/s implies that 1 indicates Add and 0 indicates Subtract

Bits 7 to 4 are the opcodes proper. This value is also shown at the top right of each opcode description in both binary and hex.

Fields: ra, rb, rc each indicate a register. 0 indicates R0, 1 indicates R1.

"-" indicates that the field is unused.

Load                 (0000b) (0h)

   7   6   5   4   3   2   1   0
   0   0   0   0  ra  rb    c2

If rb=1 then         // Indexed/Indirect
  R[ra] ß M[R1+c2]
Else rb=0 then       // Direct           
  R[ra] ß M[c2]
End If

c2 is a sign extended 2’s compliment number
I.e.:  it can have values:  +1, 0, -1, -2
Store                (0001b) (1h)

   7   6   5   4   3   2   1   0
   0   0   0   1  ra  rb    c2

If rb=1 then         // Indexed/Indirect
  M[R1+c2] ß R[ra]
Else rb=0 then       // Direct
  M[c2] ß R[ra]
End If

c2 is a sign extended 2’s compliment number
I.e.:  it can have values:  +1, 0, -1, -2
Load Immediate / Load Extended (0010b) (2h)

   7   6   5   4   3   2   1   0
   0   0   1   0  ra   -   -  i/x

If i/x=1 then        // Immediate
  R[ra] ßM[PC+1]
Else i/x=0 then      // Extended
  R[ra] ßM[M[PC+1]]
End If
Store Extended       (0011b) (3h)

   7   6   5   4   3   2   1   0
   0   0   1   1  ra   -   -   -

M[M[PC+1]] ß R[ra]
Add / Subtract       (0100b) (4h)

   7   6   5   4   3   2   1   0
   0   1   0   0  ra  rb  rc  a/s

If a/s=1 then            // Add
  R[ra] ß R[rb] + R[rc]
Else a/s=0 then          // Subtract
  R[ra] ß R[rb] – R[rc]
End If

Addition and subtraction is in 2’s compliment.
Enable IRQ / Disable IRQ (0101b) (5h)

   7   6   5   4   3   2   1   0
   0   1   0   1  ra   -   -  e/d

// Enable IRQ and set value of Period Register
If e/d=1 then
  // IRQ is enabled
  Period ß R[ra]
Else e/d=0 then
  // IRQ is disabled
End If
And / Or             (0110b) (6h)

   7   6   5   4   3   2   1   0
   0   1   1   0  ra  rb  rc  a/o

If a/o=1 then              // Logical Bitwise And
  R[ra] ß R[rb] and R[rc]
Else a/o=0 then            // Logical Bitwise Or
  R[ra] ß R[rb] or R[rc]
End If
Branch               (0111b) (7h)

   7   6   5   4   3   2   1   0
   0   1   1   1  ra  rb     C

PC ß R[ra] if R[rb] meets the condition c

c = 00   Always   Branch always.
c = 01   Zero     Branch if the contents of R[rb] is zero.
c = 10   Nonzero  Branch if the contents of R[rb] is nonzero.
c = 11   Minus    Branch if the contents of R[rb] is negative.
Shift Right / Shift Left (1000b) (8h)

   7   6   5   4   3   2   1   0
   1   0   0   0  ra  rb   c  r/l

If r/l=1 then             // Logical Shift Right by 1 bit
  R[ra] ß 0 # R[rb]<7..1>
Else r/l=0 then           // Logical Shift Left by 1 bit
  R[ra] ß R[rb]<6..0> # 0
End If

# means concatenate
R[rb]<x..y> means bits x to y of R[rb]
No Operation         (1001b) (9h)

   7   6   5   4   3   2   1   0
   1   0   0   1   -   -   -   -

Waste one cycle.
Stop                 (1010b) (Ah)

   7   6   5   4   3   2   1   0
   1   0   1   0   -   -   -   -

Stop processing instructions.
Return From ISR      (1011b) (Bh)

   7   6   5   4   3   2   1   0
   1   0   1   1   -   -   -   -

R1 ß M[SP + 1]
R0 ß M[SP + 2]
PC ß M[SP + 3]

When IRQ is received, the system automatically stacks PC, R0, R1 and disables IRQ.
Hence, when Return from ISR opcode is read, these are un-staked in reverse order.
Negate               (1100b) (Ch)

   7   6   5   4   3   2   1   0
   1   1   0   0  ra  rb   -   -

R[ra] ß not(R[rb])
Where not(x) is a bitwise negation of x.
Increment / Decrement (1101b) (Dh)

   7   6   5   4   3   2   1   0
   1   1   0   1  ra  rb   -  i/d

If i/d=1 then        // Unsigned Increment
  R[ra] ß R[rb] + 1
Else i/d=0 then      // Unsigned Decrement
  R[ra] ß R[rb] – 1
End If

Numbers are considered as unsigned numbers, hence FF-01=FE
Call Sub / Return From Sub (1110b) (Eh)

   7   6   5   4   3   2   1   0
   1   1   1   0  ra   -   -  c/r

If c/r=1 then     // Call Subroutine
  PC ß R[ra]
Else c/r=0 then   // Return From Subroutine
  PC ß M[SP + 1]
End If

When calling a subroutine, the system automatically stacks only the PC
SP points to the next empty cell in memory
The stack grows downwards in memory I.e.: FF then FE the FD etc.
SP is automatically initialized to FF
Push / Pull          (1111b) (Fh)

   7   6   5   4   3   2   1   0
   1   1   1   1  ra   -   -  h/l

If h/l=1 then     // Push onto Stack
  M[SP] ß R[ra]
Else h/l=0 then   // Pull from Stack
  R[ra] ß M[SP + 1]
End If

When calling a subroutine, the system automatically stacks only the PC
SP points to the next empty cell in memory
The stack grows downwards in memory I.e.: FF then FE the FD etc.
SP is automatically initialized to FF

Data Path (The overall interconnection of components)

The data path connects the main components of the system. Also in this schematic, the memory interface is specified. The design uses a three bus system, labeled A, B and C. The register set takes data input from all three of these busses, and nowhere else. Values from Registers and Ram can be put only on Busses A and B. Bus C is the output of the ALU. System Registers are synchronous to the falling edge of the clock, but the Control Unit is synchronous to the rising edge of the clock. Hence, control signals are generated half a cycle before registers clock in new values. Busses A and B and the ALU are asynchronous, so they respond to control signals immediately. As a result of this setup, values can be passed between registers and/or through the ALU in one clock cycle, without the necessity of latching values.

Data Path

An example of a one cycle sequence is addition. At the rising edge of the cycle, control signals are generated telling Bus A to carry the value of R0, telling Bus B to carry the value of R1, telling the ALU to add its two inputs and telling R0 to input the value from Bus C. The asynchronous Busses and ALU respond immediately and in a very shout time, the result of the addition propagates through to Bus C. On the falling edge of the cycle, R0 clocks in the value on Bus C which is the proper result.

The RAM is synchronous to the rising edge of the clock. It communicates to the system through registers MD, MA and Bus A. By default, Ram is set to Read (i.e. memWrite=0); therefore, on every clock cycle, it generates at its output, the value at the address specified in MA.

A memory read can be accomplished by generating control signals at the rising edge of clock 1 such that MA latches the address at the falling edge of clock 1. Ram will then be outputting the requested data at the rising edge of clock 2 which can be latched into the desired register on the falling edge of clock 2. Therefore a read takes 2 cycles.

A memory write can be accomplished by generating control signals at the rising edge of clock 1 such that MA latches the address and MD latches the data to be written at the falling edge of clock 1. Ram will then write the specified data to the specified address at the rising edge of clock 2 (i.e. the end of clock 1). Therefore a write takes 1 cycle.

Special logic was necessary for the memWrite control signal. Because control signals are rising edge, it would be de-asserted just as Ram (also rising edge) was supposed to read it. It was, consequently, necessary to add a flip flop to delay this signal half a clock cycle. This delay flip flop is labeled on the schematic.

For Interrupt Service Routine support, the user must be able to specify the address of their ISR. This system requires them to write the address of their ISR to memory location E0. The system consequently needs to be able to read specifically from that address, hence the constant E0 and MA are multiplexed to the Address input of the Ram.

Control Unit (Generation of all system control signals)

An external decoder, translates the four bit opcode from IR[7..3] into 16 signals, one for each opcode. The inputs to the Control unit are:
- the above 16 signals
- IR[0] and IR[2] used for distinguishing between actions within a given op code. For example: Add and Sub have the same op code and IR[0] distinguishes them.
- CON which is the result of decision logic for branching.
- IRQ which is the interrupt service request signal.
- Clock which is the system clock.
- Reset which resets the system.

Control Unit

The control unit uses these inputs to generate all system control signals. The control unit is written as a single process. Within this process is a conditional: if the reset signal is received then the unit goes to the reset state, else if there is a rising clock edge then the system resets all control signals then sets control signals and determines its next state based on its current state.

The Reset State: (Rset)

In this state, the system zeros all registers, except for SP which it sets to FF. It also initializes run, and disables IRQ. It sets the next state to T0 which is the beginning of Opcode Fetch. Hence, all programs must start at address 00 because that address is expected to hold the first opcode after reset.

Opcode Fetch: (states T0 and T1)

Every operation except the servicing of an ISR begins with the opcode fetch carried out in states T0 and T1. It is only in state T0 that the system checks the IRQ signal. If there is an Interrupt Service Request then the system stacks the system state and services the interrupt. Otherwise it fetches the next opcode and carries out the instruction sequence to completion.

State T2 is the most complex because it is in this state that the control unit considers the opcode held in IR. Most instructions are completed in this state and set the next state to T0. Some instructions require more clock cycles and, consequently, have extra states.

In any state, first the control unit sets the appropriate signals, then it sets the next state. It may make a decision based on the contents of the IR as to what signals to set or which state should be next, but that is often inherent to the state and no therefore decisions are necessary.

Grx Logic (Generation of control signals for R0 & R1)

The Grxlogic module for the mini SRC is responsible for the generation of the control signals for R0 and R1. More Specifically, this module is responsible interfacing the Instruction Register fields a , b, c with actual registers. This is done through a grouping of ‘Sum of Products’ logic gates. Inputs to the Grxlogic module are IR fields [3..1] and the Grx signals which are generated by the Control Unit. The outputs of the Grxlogic module are the control signals for the desired register to be activated.

Branch Logic

This module is responsible for the interface between the IR branch fields and the control unit. This module interprets the appropriate field within the instruction register and determines if the branch condition is met. If the branch condition is met a CON signal is asserted and is processed by the control unit. The inputs to this module are IR fields [1..0] , BUSA, clock, CONclear, and CONin. The output is simply the CON signal, which is sent to the control unit.

Register Set

The register set is the collection of registers that are available in the system. For simplicity they were grouped together within a module. Some of the registers within the module have the ability to accept input from multiple buses (reg2input). The inputs to the Register Set module are the control signals which enable the specific registers, the 3 Bus lines (A,B,C) clock, and clear. While the outputs of the Register Set module are the outputs of the individual registers within the module.

ALU (Arithmetic and Logic Unit)

The ALU can perform 10 functions:
- Logical Bit-wise OR of Bus A and Bus B
- Logical Bit-wise AND of Bus A and Bus B
- Logical Bit-wise NEGATE of Bus B
- Logical Shift Right (by one bit) of Bus B
- Logical Shift Left (by one bit) of Bus B
- Addition: Bus B + Bus A
- Subtraction: Bus B – Bus A
- Increment Bus B by 1
- Decrement Bus B by 1
- Sign Extend IR[1..0]
- Addition: Bus A + Sign Extended IR[1..0]

Most of the implementation is straightforward and apparent from the schematic; however, the interface to the adder/subtracter is somewhat complicated. Each if its inputs are multiplexed with two input multiplexers. This allows us to consider four useful combinations:
- Bus A and Bus B. This is useful for addition
- Bus B and Zero. This is useful for incrementing and decrementing.
- Sign Extended IR[1..0] and Zero. This is necessary for Direct Load and Store instructions.
- Sign Extended IR[1..0] and Bus A. This is necessary for Indexed/Indirect Load and Store instructions.

IRQ (Interrupt Support)

The mini SRC processor has a Timer interrupt system. The interrupt signal is generated by a free running 8 bit counter which is clocking with the processor clock. Configuring the system for interrupt support is done through the use of reserved opcodes that have been hardwired into the system.

As the mini SRC clock runs, a free running 8 bit counter is incremented by 1 with each clock cycle. When the counter reaches FF it simply rolls over and begins counting again at 00. With each clock cycle the value in the free running counter is compared to a user specified 8 bit value that is stored within a register. The comparison is done with a simple 8 bit compare circuit available in the MEGA_LPM package. If the comparison yields a match, and the user has specified interrupts to be enabled within the system, an IRQ signal is generated from the interrupt circuitry and the control unit begins to process the interrupt service routine (ISR). By design , the address of the ISR is stored at location E0 in memory. This memory address is reserved for the ISR jump vector and the user should be careful that E0 contains the desired starting address for an ISR. When the system breaks to process an ISR it stores the current state of the system (ie: register values, Program counter.) onto the system stack and begins processing the ISR at the user specified address. While the system is processing the ISR , it ignores interrupts in order to allow processing to be complete, it is the users responsibility to ensure the Return from interrupt opcode is placed at the end of the ISR. Once the control unit detects the Return from Interrupt opcode it proceeds to generate the appropriate signals that pull (from the stack) the system state information back into the appropriate registers.

1 - Store the desired ISR starting address (jump vector) at address E0 in memory. Remember that E0 is a RESERVED address and should contain the starting address of the ISR if interrupt functionality is desired.

2 - Enable interrupts using the Enable interrupts Op code (see opcode spec), the opcode requires for the user to specify the location of the interrupt period (in R0 or R1). The value in the specified register will correspond to the clock number at which to generate an interrupt.

3 -At the end of the interrupt service routine the user should use the Return from ISR opcode (see opcode spec) to complete the process.

Example Using Interrupts

The interrupt is set to fire after 34 clock cycles (22 hex). It was chosen that this interrupt would happen during a sequence of NOP instructions. The user specifies the ISR starting address to be D0. Finally, within the ISR, registers R0 and R1 are both loaded with 99 (hex) to demonstrate that the ISR is being serviced.

DEPTH = 256; % Memory depth and width are required %
WIDTH = 8;   % Enter a decimal number %

ADDRESS_RADIX = HEX;   
DATA_RADIX = HEX;          

CONTENT
     BEGIN

     01        :    21;       % ldi , R0 <= 22 %
     02        :    22;       % Fire interrupt after 22(hex) clock cycles %
     03        :    29;       % ldi R1<= D0 %
     04        :    D0;       % ISR address D0 %
     05        :    38;       % stx E0 <= R1 (DO) %
     06        :    E0;       % Write D0 to reserved address E0 %
     07        :    51;       % enable interrupts, get interrupt period (22) from R0 %
     08        :    28;       % Ldx R1 %
     09        :    03;       % addr %
     0A        :    90;       % NOP %
     0B        :    90;       % NOP %       <--- Expect interrupt to fire in here
     0C        :    90;       % NOP %
     0D        :    90;       % NOP %
     0E        :    42;       % Add: R0 = R0 + R1 = 43%

%%%%%% THE ISR %%%%%%%

     D0       :    21;
     D1       :    99;
     D2       :    29;
     D3       :    99;
     D4       :    B0; %return from interrupt%

END ;

The following is a simulation of the above program, note that the IRQ signal is a pulse that occurs after 34 (22 hex) clock cycles, somewhere within the NOP instruction sequence (NOP is opcode 90)

Interrupt Simulation

Results

Instruction                 Number of Cycles

Service Interrupt Request   7
load                        4
store                       4
ldi                         4
ldx                         5
stx                         4
add                         3
sub                         3
enable irq                  3
disable irq                 4
and                         3
or                          3
branch                      4
shift R                     3
Sihft L                     3
NOP                         3
STOP                        3
Return from ISR             8
Neg                         3
INC                         3
DEC                         3
Call Subroutine             4
Return from Subroutine      4
Push                        4
Pull                        4

Maximum frequency of operation in the simulator:  15 Mhz
Maximum frequency of operation on the chip:       15 Mhz
Average Cycle Per Instruction (CPI):       96 / 25 = 3.84
MIPS rating (for 15mhz physical test run): 15Mhz / 3.84 cpi = 3.9 MIPS
Memory utilized: 16%
LCs utilized:    40%
# of LC’s:       464

Evolution of the Design

The objective of Phase 1 was to design and implement the Data path for the mini SRC. This data path consisted of a 1-BUS 8 bit wide system. After completion of the 1-BUS system a 3-BUS implementation would be straightforward. In anticipation of a switch to a 3-BUS system the registers were all placed into a single Design schematic that would lend itself to modularity.

The Bus was implemented using a simple 8 input multiplexer with 3 selects. Some of the components along the data path were not implemented within the first phase and thus the signals needed were manually generated during simulation. The final product at the end of the first phase was a functional data path along with limited ALU functionality. All opcode decoding, was implemented in a temporary fashion which would suffice for the testing and simulation of the data path.

During Phase 2 much of the Instruction Decoding logic was added to the mini SRC. Outputs from the instruction register were passed through combinational logic so that op code instructions could be decoded. Furthermore, memory interface design was added to the data path to allow for memory accesses. The type of Ram used was Synchronous RAM with 256 accessible memory locations each containing 8-bits of data. Further modification was made to the ALU design in order to incorporate the instruction-decoding scheme employed. At the end of the phase memory access instructions, ALU instructions and , Branch instruction were all tested and simulated for expected results.

Memory Interface

Register Select Decoding Logic

The major task of Phase 3 was the addition of a control unit. The control unit would contain the cycle-by-cycle instruction sequence of all opcodes. Implementation of the control unit was done through a finite state machine that was written in VHDL. After the control unit was written in VHDL it was placed within the data path, tested and simulated. Furthermore, the entire processor was tested by writing a small program into RAM and verifying simulation results.

The major upgrades to the processor in this phase were as follows: 3-BUS implementation, System stack support, Sub Routine call support, and finally Interrupt support. The move to a 3-BUS architecture required a thorough reassessment of many system components. Major changes had to be applied to the control unit now since the additional Bus support would mean some cycles in certain instructions could be eliminated. Furthermore, every instruction sequence in the control unit had to be re – coded to incorporate multi – Bus support. Components such as Temp register A and C were removed since they were no longer needed; operations could be preformed in parallel on the additional Buses. After all modification and upgrades were complete, a full timing and functional simulation was Performed and the processor was then uploaded onto a FLEX EPF10K20 chip and tested from 1 – 15 MHZ. A program that had been written into memory was observed by noting appropriate values that would appear on the LED’s on the ALTERA evaluation board.

Test

The following program tests all instructions except interrupt discussed above.

ORG 00

  ldi  R0, $67 ;       R0 = $67
  ldx  R1, $80 ;       R1 = ($80) = $44
  add  R0, R0, R1 ;    R0 = $AB
  and  R1, R0, R1 ;    R1 = 00
  ldx  R1, $81 ;       R1 = ($81) = 01
  sub  R0, R0, R1 ;    R0 = $AB – 1 = $AA
  shr  R0, R0, 0 ;     R0 = $55
  stx  R0, $80 ;       ($80) = $55
  bral R0, R1 ;        PC = $55

ORG $55

  loop nop
  brnz R0, R1 ;        branch to loop for ever

Extra Features

In addition to the functionality required by the official specification sheet, the features below were added. Note that even though load and store were not necessary (in light of Ldi, Ldx, Stx), they also have been implemented.

1- Load – loads from 1 of 4 specified address or from address + offset
2- Store - stores to 1 of 4 specified address or from address + offset
3- OR – Bit-wise logical OR of specified values.
4- ShiftL – shifts specified register value to the left.
5- NEG – negates specified register value
6- INC / DEC – increment or decrement specified register value by 1
7- 3-Bus architecture
8- Timer Interrupt – user can enable/disable interrupts to occur at a given clock #
9- Stack (push , pull) – user has access to a Stack
10- Subroutines (Call Routine, Return ) – user has ability to jump to a specified routine and return from it.

Conclusion

After completing the mini SRC it was found that the most significant improvement in performance was gained by implementing a 3-Bus architecture. This Greatly reduced our CPI and resulted in performance increases. Unfortunately, the addition of a 3-BUS architecture introduced some timing issues which may have reduced our maximum frequency of operation.

The mini SRC that was produced is capable of performing many common instructions that are typically required by processors used in small embedded systems. The additon of interrupt handling makes the mini SRC have even more applications such as timers, and more complicated user – feedback systems.

Future work

Examples of future work that can be undertaken for the mini SRC include
1 - The addition of Input and Output ports with handshaking
2 - Different Types of interrupts (Input compare / output compare)
3 - 16 bit Operand support instead of 8 bit
4 - Additional ALU operations such as Multiply / Divide
5 - Condition Code register which sets bits after register contents have changed Condition bits : Negative, Overflow, ZERO, Non-ZERO

{ "loggedin": false, "owner": false, "avatar": "", "render": "nothing", "trackingID": "UA-36983794-1", "description": "A simple RISC processor (mini SRC). Specifications and simulations of: Data Path, Grxlogic , Branch Logic, Register Set, ALU, and IRQ. Conclusions about the CPI, MIPS, and maximum frequency of operation are included.", "page": { "blogIds": [ 234 ] }, "domain": "holtstrom.com", "base": "\/michael", "url": "https:\/\/holtstrom.com\/michael\/", "frameworkFiles": "https:\/\/holtstrom.com\/michael\/_framework\/_files.4\/", "commonFiles": "https:\/\/holtstrom.com\/michael\/_common\/_files.3\/", "mediaFiles": "https:\/\/holtstrom.com\/michael\/media\/_files.3\/", "tmdbUrl": "http:\/\/www.themoviedb.org\/", "tmdbPoster": "http:\/\/image.tmdb.org\/t\/p\/w342" }