Lectures‎ > ‎

Week15

The Ultimate Guide to Project06

This guide will provide help for how to develop a solution to Project06 that can be used in conjunction with the class lecture notes.

General Advice

  • This is a complex project, but very doable if you put in the time.
  • You should try to develop a solution on your own with limited help from classmates. You want to do it yourself so that you understand how everything works and so you can answer questions on the final exam. Of course, for specific questions, please ask myself or the TAs in person or on Piazza. My point here is that you need to struggle and solve the hard parts yourself to get the most out of this project.
  • You can follow the design provided in the slides I linked to on the Project06 page, but I encourage you to develop your processor on your own, by building your own data path, control path, and control signals. In particular, the control unit from the slides is somewhat complex, I propose a simpler design in my lecture notes and in this guide. The approach I describe here and in class allows you to incrementally develop your processor. This way you can get simple machine code programs working at first, then evolve your processor to support more instructions.

Big Picture and Major Components

For Project06 we are building an ARM processor using a single-cycle micro-architecture. In a single-cycle processor, we execute a single instruction completely on each clock cycle. Our processor will need several major components in order to support the ARM subset needed by our Project03 test programs. See the following figure.



The PC register will hold the address of instruction we are currently executing. The Instruction Memory holds the machine code representation of an ARM assembly program. We will generate the machine code directly from the object files produced by the ARM assembler on the Raspberry Pi. The Register File is a component that supports register reading and writing specific to ARM instructions. We will also expose the internal register values as outputs in the Register File so that we can see the register state in our top level (main) processor circuit. The ALU will support the arithmetic operations needed to implement each instruction, like addition, subtraction, and multiplication for data processing instructions as well as addition for branch target calculations and addition and subtraction for memory address calculations for ldr and str. The Extender will support 8-bit and 12-bit zero extension to 32-bit values as well as 24-bit sign extension. The Extender is needed to properly extract immediate values from data processing, memory, and branch instructions. Finally, Data Memory will be implemented as a Digital RAM component with separated ports. Data Memory is where our stack will be located and we will use the stack to allocate arrays for programs such as sum_array() and find_max().

Note in the diagram above I have not shown any of the data path, the control path, or even the control unit. I will discuss adding these elements below.

Processor Development Strategy

Here is a development strategy that I suggest you use for implementing your processor:

  1. Implement and test the major components shown above and described in more detail below.
  2. Create a main processor circuit and wire up the inputs to all the components and create probes using tunnels to view the processor state (register values and the CPSR values N, Z, C, V). Also, it will be useful to use splitters to see all of the components of the different instruction types to help you as your are developing your processor.
  3. Start with a simple test program to begin with. I provide first_a.s in the inclass repo to get you started.
  4. Load the ROM with this program and get the PC to updated on each clock cycle with PC = PC + 4. In this way you can begin to see the different instructions come out of Instruction Memory (ROM).
  5. Wire up a minimal data path to support add and mov. Initially, use explicit input to the components to get these instructions working.
  6. Build an initial control unit to get add and mov working.
  7. Add to the data path to support the b (branch) instruction and modify the control unit to support branch.
  8. You should be able to run first_a.s now.
  9. Evolve the data path, control path, and control unit to support bl and bx. In this way you can support driving the test code from a main function.
  10. Evolve the data path, control path, and control unit to support mul. Now you can execute quadratic_a.s.
  11. Evolve the data path, control path, and control unit to suppor ldr, str. Now you can allocate space on the stack.
  12. Add support for remaining variations of data processing and memory instructions (like immediate versions).
  13. Add support for conditional execution of branches (based on the CPSR bits).
  14. You should now be able to run versions of sum_array, find_max, fib_iter, and fib_rec.
  15. Clean up your processor design and top-level structure.
  16. Do final testing of all the test programs.

The Major Processor Components

In this section we will look at all of the major processor components and how they should work. You job is to either build or configure the components as necessary.

PC

The PC is simply a Digital register component. You need to configure the PC register to hold a 32 bit value. You will connect the top-level clock component to the clock input of the PC. For our single-cycle processor, the en (enable) input will alway be high (set to 1) because we will always update the PC on each clock cycle. The D (input) will ultimately come from a few different sources. Normally for non-branch instructions we will up date the PC with PC + 4 so that we can execute the next instruction in memory. However, later you will add support to update the PC using the branch target address from a b (branch) or bl (branch and link) instruction. You will also add support to update the PC from a register value using the bx (branch and exchange) instruction. The D output of the PC will be used to address the instruction memory ROM to properly retrieve the instruction word at the PC address.

Instruction Memory

Initially the Instruction Memory will be a Digital ROM component. We will configure the ROM with 32 data bits (each instruction word is 32 bits) and enough address bits to support your largest test program. So, for example if you set the address bits to 5, then you can store a program that has up to 2^5 = 32 instructions, similarly if you set the address bits to 6, then you can store a program that has up to 2^6 = 64 instructions. Note that the address input to the ROM component is a word address (not a byte address). So you need to convert the PC address from the PC register from a byte address to a word address using a splitter. You can use the same splitter to extract the number address bits you have specified when configuring the ROM.

Using objdump and the makerom.py script from the Project06 page you can generate a hex file that can be loaded properly into the Digital ROM component when you edit the contents of the ROM. Here is the process.
  1.  Create an assembly program (.s) with the code you want your processor to execute. Note that you DO NOT want to use the .global, .func, or .endfunc directives in this code.
  2. Assembly the code: as -o foo_a.o foo_a.s
  3. Generate the hex file: objdump -d foo_a.o | python makerom.py > foo_a.hex
  4. No you can load foo_a.hex into the instruction ROM.
Later you will want to create a new Instruction Memory component that contains several ROM files, one for each test program you plan on running on your processor. In this way you do not need to manually load each test program. Instead you will have a program number input that selects which program you want to run. This can be implemented with 2 or more ROM components and a MUX (multiplexor). The input address will be connected to the address inputs of each of the ROMs. The data output of each ROM will be connected to a MUX that selects one of the instruction words depending on the program number input.

Register File

The Register File supports reading up to two register values and writing to a single register in a single clock cycle.


Here are the inputs and their bit widths (sizes) and usage:
  • ReadReg0 (4 bits) selects the register value to output on RD0.
  • ReadReg1 (4 bits) selects the register value to output on RD1
  • WriteReg (4 bits) selects the destination register to update.
  • WriteEn (1 bit) determines if we write to the WriteReg register on current clock cycle.
  • WriteData (32 bits) the data value to write to the destination register selected by WriteReg.
  • PC (32 bits) the value of PC to be passed from the PC register so that the PC value (register 15) can be selected on RD0 or RD1 like all the other registers. Note you will want to input PC + 8 into the PC input as PC + 8 will be expected when computing the branch target address for the branch instruction.
  • CLK (1 bit) the clock input from the top-level clock component.
Here are the outputs and their bit widths and usage:
  • RD0 (32 bits) the data from register number ReadReg0
  • RD1 (32 bits) the data from register number ReadReg1
  • r0-r15 (32 bits) outputs for the 16 register (included the PC which is passed through). These are used to allow you to setup up probes on the top-level processor circuit to see the state of all the register values.
Note, as described above, you will pass in PC + 8 into the PC input on the Register File. However, you will likely also want to see the current PC value, so you will want two probes for the PC (the current PC and PC + 8 from the Register File).

You can follow the implementation of the Register File as discussed in class and found in the notes. You can use 15 32-bit registers, two 16to1 32-bit MUXes (one for RD0 and one for RD1), and 4to16 Decoder with Enable for selecting up to one of the registers for writing.

Extender

The Extender support two forms of zero extension and one form of sign extension to support extracting immediate values from the instruction word. This is needed to extract the 8-bit immediate from the data processing instructions, the 12-bit immediate from the memory instructions, and the 24-bit immediate from the branch instructions.


Inputs:

  • iw (32 bits) the current instruction word.
  • EXT (2 bits) the extender selector.
Outputs:
  • extimm (32 bits) the extended value.
The EXT input selects one of the follow forms of extension:
  • 00 : 8-bit zero extension. Concatenate the lower 8-bits from the iw with 24 bits of 0 to output a 32-bit value.
  • 01 : 12-bit zero extension. Concatenate the lower 12-bits from the iw with 20 bits of 0 to output a 32-bit value.
  • 10 : 24-bit sign extension to a 32 bit value.
Note that in order to support proper branch target address calculations, in addition to sign extending the 24-bit immediate, you will also need to multiply this value by 4 to convert the word offset into a byte offset. You can either use the Digital Multiply component or use the Digital Barrel shifter to left shift by 2 (which is equivalent to multiplying by 4). You can put this multiplication step in the Extender or outside the Extender.

ALU - Arithmetic Logic Unit

The ALU support the major type of mathematical computations needed by different instructions. For example we need addition, subtraction, and multiplication for the data processing instructions. We need addition and subtraction to compute the target address for the memory instructions. We need addition to compute the branch target address for branch instructions.


Note that the ALU is purely combinational, it holds no state and therefore does not need a clock input.

Inputs:
  • A (32-bits) the first ALU operand.
  • B (32-bits) the second ALU operand.
  • ALUop (2-bits) selects a specific operation.
Outputs:
  • R (32-bits) the ALU result.
  • NZCV (4-bits) the CPSR values computed for the CMP instruction. This 4 bit output will be wired to the control unit.
  • N (1-bit) Negative used for display on the top-level.
  • Z (1-bit) Zero used for display on the top-level.
  • C (1-bit) Carry used for display on the top-level.
  • V (1-bit) Overflow used for display on the top-level.
Note you could also split the 4-bit NZCV value on the top level and eliminate each of the individual 1-bit values.

The ALUop input is defined as follows:
  • 00 : addition
  • 01 : subtraction
  • 10 : multiplication
  • 11 : mov
Note, it occurred to me after last week that it may be easier to support mov by using the unused ALUop value 11 to simply pass the B input value to the R output, thus simplifying the top-level processor circuit. Please see the notes from class on how to implement the ALU and how to properly compute the N, Z, C, and V values.

If you need your ALU to support more operations, e..g, mvn (move and negate) then you will need to increase the number of bits for the ALUop and add a new code for each new operation. This will impact your control unit because instead of generating a 2 bit ALUop output, it will now need to generate a 3 bit ALUop output.

Data Memory

We will simply use a Digital RAM component for Data Memory.

Note we will use the RAM with separate ports for simplicity. This means that we have a dedicated input port for the data input (str) and a dedicated output port (ldr). We will configure the RAM to use 32 data bits, which means we can only read and write words (32-bit values). You can configure the address bits to be the amount of memory you will need. Since we will will assume the stack lives in Data Memory, the size of the RAM will determine how much stack space can be used. This will be important for fib_rec(), which can use a lot of stack space. You will have to experiment to figure out how much you need. 

Inputs:
  • A (size of address bits) This is the input address, a word address.
  • Din (32-bits) the value to be written to memory at the address A.
  • str (1-bit) set to 1 if we are writing to memory.
  • C (1-bit) clock input
  • ld (1-bit) set to 1 if we are reading from memory.
Outputs:
  • D (32-bits) the data value read from memory at address A.
Note that just like the Instruction Memory ROM, we are only supporting word addresses, which means that after computing a target memory address for ldr and str, we need to convert this byte address into a word address before sending it to the RAM component. In addition, we need to ensure the target address size in bits is the same as the number of bits as the A input.

Initial Top-Level Setup

PC = PC + 4
Wire explicit inputs to all components
Add tunnels and problems to see register state and ALU NZCV bits.

Here is a picture of an initial top-level setup that includes the PC, Register File, and ALU. This version allows you to step through a program and see the instruction words as the PC increments. In addition with the explicit inputs, you can simulate instruction execution to get a feel for what the eventual control unit needs to do.


First Implementation: first_a.s

Now consider implementing a processor that can support the execution of first_a.s:

first:
    mov r0, #1
    mov r1, #2
    add r2, r0, r1
end:
    b end

Follow these steps:

  • Add data path for add instruction. Initially the processor can only execute add.
  • Add data path for mov instruction.
  • Add initial control unit to choose between add (register) and mov (immediate).
  • Now you can execute mov and add and do 1 + 2 = 3 (result in r2).
  • Add data path, control path, and modify control unit to support b (branch) instruction.
Here is a picture of a top-level implementation that can support the execution of first_a.s:


To develop a control unit you should create a table that enumerates the inputs from the instruction word and the output control lines. This way you can determine the appropriate input/output values, then translate the table into digital logic. I use Google Sheets to create a control unit table. Here is a version of the table that supports the instructions needed for first_a.s:
In this table, we recognize the data processing instruction with op2 = 0 and op1 = 0. The op0-i-dp bit is the Immediate bit for data processing, but is also used as the op0 bit for the branch instruction. Note the opcode (opc) for add is 0b0100 and the opcode for mov is 0b1101. The control outputs are:
  • RFW (Register File Write, 1 bit) : This is connected to the WriteEn line on the register file to indicate we want to write a value. For all data processing instructions with the exception of compare, we want to write the result to the destination register.
  • EXT (Extender Select, 2 bits) : This determines which extension value we want from the Extender: 00 is the 8-bit zero extended value, 01 is the 12-bit zero extended value, and 10 is the 24-bit signed extended immediate value.
  • ALU1src (1 bit) : This determines which value we we are going to send to the second input of the ALU (B). It will ether come from RD1 from the Register File (0) or it will come from the Extender (1).
  • ALUop (2 bits) : This determines the operation to be performed by the ALU:
    • 00 : addition
    • 01 : subtraction
    • 10 : multiplication
    • 11 : move
  • BR (Branch, 1 bit) : If this bit is set to 1, it means we are executing a branch instruction. This is used to tell the Register File to output the PC value on RD0 and to choose the next PC update to come from the ALU result, which is the branch target address.
Once you have your control unit table and any updates to the data path and control path in your top-level processor circuit you can build the control unit. Here is a picture of the digital logic implementation of the control unit table above:




The goal is to create input lines for the inputs specified in the control unit table. You can create vertical wires connection to tunnels from from the instruction word. From these wires you can build logic that identifies each instruction type. For each identified instruction type, you wire the identification to each output that should be set to 1 for the instruction. We need to route these lines to the outputs via OR gates because multiple instructions will need to set the same control line to 1. Initially, if a control line is always 0, you can just route the constant value 0 into the OR gates for each output. Eventually the zero input values will go away. Note that we can use the Digital comparator to check the opcode for each data processing instruction. When you add more variations of the data processing instructions you can reuse this comparator.


Second Implementation: first_main.s

Once you get the first version of the processor above working you can move on to a version that supports bl and bx. Here is a variation of first_a.s called first_main_a.s that adds bl and bx:

main:
    mov r0, #1
    mov r1, #2
    bl first
end:
    b end

first:
    add r0, r0, r1
    bx lr

You will need to evolve your data path, control path, and control unit to support these additional instructions.

Here is a picture of the control unit table for bl and bx:

This table also adds rows for the immediate variation of add and the register variation of mov. Notice the addition control lines for BL and BX.

Here is a picture of the control unit logic with support for bl and bx as determined from the table:


In this first of the control unit, I am using tunnels to connect the instruction identification logic to the OR gates for each of the control line outputs. This makes the design much clean as we add support for more instructions. Also note the use of a comparator to check for the bx instruction directly.

Conditional Execution

We need to add support for conditional execution in order to support beq, bne, bge, and others. We can extend the control unit by encapsulated the control logic above into a component called the Main Decoder. We can then create a new control unit that uses the Main Decoder and add conditional execution support. We will need a 4-bit register for storing the CPSR bits when we execute the cmp instruction. Recall that when we execute cmp, we save the NZCV bits from the ALU into the CPSR register. We can then check the condition code of each instruction to see if we need to conditionally execution. For our programs we only need to support conditional execution for b (branch). Here is an incomplete version of a new control unit with partial support for conditional execution:


To support conditional execution you will need a new control output from the Main Decoder that determines if we executing the cmp instruction. If so, the en input to the 4-bit CPSR register needs to be set to 1. This provide the ability to store the NZCV bits from the ALU into the CPSR. Next, we need to look at the 4-bit condition code from the instruction word (iw) to see if we need to conditionally execute the instruction. If the condition code is AL (0b1110), then we always execution. However, if the condition code is EQ then we check to see if the Z bit from the CPSR is set to 1. The result of the condition checking goes into an OR gate which then controls the BR control line. That is, if the conditional code is AL then we take the branch (BR = 1). If the the conditional code is EQ and the Z bit from the CPSR is set to 1, then we take the branch (BR = 1). If the code is EQ and the Z bit is 0, then we don't take the branch (BR = 0). You can add support for addition condition codes in a similar fashion. Note that if you want to support conditional execution of any instruction type (not just b), then all you need to do is AND the condition check line with the state enable control lines like RFW (and eventually a Memory Write control line). The idea is that in order to ignore an instruction we simply don't update the state that would be changed by the instruction.








Comments