\documentclass[11pt, a4paper]{article} \usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \usepackage[margin=1.0in]{geometry} \usepackage[british]{babel} \usepackage{indentfirst} \usepackage{array,booktabs,longtable} \usepackage{multirow} \usepackage{listings} \usepackage{comment} \newcommand{\prog}[1]{{\tt#1}} \newcommand{\underscore}{$\_\,$} \newcommand{\vivace}{\textsc{Vivace}} \begin{document} \title{Conception and realization of the \vivace{} architecture \\ \normalsize{\textsc{Projet de Système digital}}} \author{A. Auvolat \and É. Enguehard \and J. Laurent} \maketitle The \vivace{} architecture is a minimalistic 16 bits RISC microprocessor architecture, largely inspired by the MIPS microprocessor. The principal characteristics of the architecture are: \begin{itemize} \item \textit{8 general-purpose registers}, which can hold 16 bits integers: \prog{Z, A, B, C, D, E, F, G} \item \textit{16 bit memory addressing}, enabling the CPU to use up to $64kb$ of memory. \end{itemize} In order to implement and run the architecture, the following programs have been written: \begin{itemize} \item \textit{Netlist simulator} and \textit{netlist optimizer} \item An \textit{OCaml library} for generating netlists from Caml code \item The code for the \textit{CPU implementation} (written in Caml) \item A \textit{monitor} which is used to interact dynamically with the netlist simulator \item An \textit{assembler} which can be used to produce the ROM files run by the CPU. \end{itemize} The reader keen on learning valuable and useful information should skip the next section and jump to section \ref{sec:useful}. He who does not strive for usefulness may proceed. \section{The name \vivace{}} \vivace{} is a recursive acronym standing for “\emph{Virtually Infallible \vivace Automated Computing Environment}”. We do not think anyone would dispute that \vivace{} is an automated computing environment, and shall thus discuss the rest of the allegations that are made in this name. \subsection{Why is \vivace{} virtually infallible ?} \vivace{} bears a great many similarities to the pope, in that, among other things, it is virtually infallible the sole and only reason that we, its creators, are able to claim as loudly as needs be that it is so. \subsection{Why “\vivace{}” ?} “\vivace{}”, as anyone knows, is an italian word meaning, surprisingly enough, “vivacious”. Outside of its homeland, the word is mostly known for its place of honour in the musical glossary. Its inscription at the beginning of a musical piece means the composer of said piece intends it to be played in a lively tone, yet the interpret should not strive for fastness and virtuosity, as he should if the inscription were “Allegro” or “Presto”. Its name, thus, suits perfectly \vivace{}; it is certainly as fast as a software-emulated microprocessor can be, yet, as it is the sad fate of any software-emulated microprocessor, it is extremely slow. One may ask: what does \vivace{} has to do with music? In fact, the \vivace{} CPU has eight registers, seven of which are referred to with the letters \prog{A} through \prog{G}. In English musical notation, those letters happen to be the names of the seven notes of the usual scale.\footnote{In German musical notation, the letters \prog{A} through \prog{G} are the names of the same seven notes, but in F major instead of C major. Thus the letter \prog{B} represents an English B-flat, while an English B is represented by the letter H.} This relation between \vivace{} and music can be used to translate \vivace{} programs into musical pieces, which, while extremely interesting from an intellectual point of view, is remarkably useless. Consider, for instance, the following program, that uses only elementary instructions (see section \ref{sec:assembly} for a description of the VIVACE assembly language): \begin{verbatim} .text # reads a nonnegative rational number and outputs its integer part and # whether it is an integer or not liuz A 0x41 # input address lbr B 0(A) # numerator lra E 2 lbr C 0(A) jer E C Z # denominator must be non-zero divu D B C se G E Z mulu F D C lil A 0x02 # output address sb F 0(A) sb G 0(A) \end{verbatim} Mapping cycles to notes' length and write access to registers to the pitch, we get the following piece, for the input \prog{31 10}: \lilypond[fragment,notime,quote]{ \set Timing.defaultBarType = "" a'4 b'4. e'4 c'4. s \appoggiatura e'8 d'\breve g'4 f'2 a'4 s2} As should be expected, it sounds awful. \section{How to run the \vivace{} cpu} \label{sec:useful} \subsection{Preparation} All the tools described in the introduction must first be compiled: \begin{verbatim} $ cd csim; make; cd .. $ cd sched; make; cd .. $ cd monitor; make; cd .. $ cd asm; make; cd .. \end{verbatim} To run the \vivace{} CPU, type the following: \begin{verbatim} $ cd cpu; make \end{verbatim} \subsection{Monitor commands} You are now running the \vivace{} CPU. The monitor accepts a few commands to control the simulation. First, you must configure the monitor to communicate with the CPU. Type: \begin{verbatim} t 0 s 1 19 18 d7 20 21 22 23 24 25 26 27 \end{verbatim} The first command sets up the tick input (a tick is sent once every second on this input by the monitor). The second command sets up the serial input/output. The third command sets up the 7-segment display (8 digits displayed). Now, use the following commands to control the simulation: \begin{itemize} \item \prog{a} run the simulation at full speed \item \prog{m} run the simulation step by step (enter an empty command to run a step) \item \prog{f } run the simulation at fixed frequency (frequency is dynamically ajusted so this is not very accurate) \item \prog{q} exit simulation \end{itemize} The CPU recieves commands on the serial input. To send a command to the CPU, use the following syntax: \begin{verbatim} : \end{verbatim} For instance: \begin{verbatim} :Y2014 \end{verbatim} These commands are essentially used to set one of the six variables \prog{YMDhms} ; the syntax is similar to the example command given above. An empty CPU command tells the CPU to just tell us what time and what day it is. \section{Program details} \subsection{Generating netlists from Caml code} We have developped a library that enables us to easily generate netlists from Caml code. The Caml code we write has the same abstraction level that MiniJazz has, but it is more comfortable to write circuits like this than with MiniJazz code. The library functions are defined in \prog{cpu/netlist\_gen.mli}. Basically, we have created functions that build the graph of logical operations. The abstract type \prog{t} is actually a closure to a function that adds the required equation to a netlist program being built, therefore the generation of a netlist consists in two steps: the generation of a closure graph that describes the graph of logical operations, and the execution of these closures on a program which, at the beginning, has only the circuit inputs. The equations are progressively added to the program when the closures are called. The \vivace{} CPU has been entirely realized using this library. \subsection{The \vivace{} CPU} \subsubsection{Control structure} The CPU is able to execute instructions that need several cycles to run. The two first cycles of an instruction's execution are used to load that instruction (16 bits have to be read, ie two bytes). Most instructions finish their execution on the second cycle, but some executions need more cycles to run: \begin{itemize} \item Load and store instructions need one or two extra cycles \item The multiplication operation needs as many cycles as the position of the most-significant non-null bit in the second operand. \item The division always runs on 16 cycles. \end{itemize} The execution of instructions on several cycles is implemented using a ``control bit'' that cycles through several steps: load instruction, various steps of instruction execution. A few of these step control bits appear in the simulator, as CPU outputs: \begin{itemize} \item \prog{read\_ilow}, \prog{read\_ihi} CPU is reading low byte/high byte of the instruction \item \prog{ex\_instr} CPU begins execution of the instruction \item \prog{ex\_finish} CPU finishes execution of the instruction (modified registers may only appear in the monitor at the next step) \end{itemize} \subsubsection{ROM, RAM and MMIO} The CPU has uniform acces to a 64kb address space, which contains the ROM (\prog{0x0000-0x3FFF}), MMIO (\prog{0x4000-0x7FFF}) and the RAM (\prog{0x8000-0xFFFF}). The \prog{cpu\_ram} (\prog{cpu.ml}) subcircuit is basically a bunch of multiplexers that redirect reads and writes to the correct places. The serial input/output is implemented using one input and two outputs : \begin{itemize} \item Input \prog{ser\_in} (8 bits) : when this input is non-null, the character entered is buffered by the CPU. This buffer can be read by reading MMIO byte at address \prog{0x4100}. The buffer is reset to zero on read. \item Output \prog{ser\_in\_busy} (1 bit) : signals when the input buffer is nonzero (ie a character is pending, waiting for the CPU to read and handle it). \item Output \prog{ser\_out} (8 bits) : when non-null, the CPU is sending a character to the serial output. This output can be written by writing MMIO byte at address \prog{0x4102}. \end{itemize} The clock is also handled by MMIO : the CPU recieves a tick every second on input \prog{tick}. When a tick is recieved, the tick buffer is incremented by one. This tick buffer can be read by reading MMIO word at address \prog{0x4000}. When the word is read, the buffer is reset to zero. The 7-segment display is also handled by MMIO : the 8 digits can be modified by writing a byte to MMIO addresses \prog{0x4200} to \prog{4207}. \subsubsection{The ALU} The \emph{Arithmetic and Logic Unit} implements the following features : \begin{itemize} \item A \textbf{parallel addition circuit} consisting in a linear chain of full--adders. The same circuit is used for doing substractions. There is no distinction between \textit{signed} and \textit{unsigned} addition. \item A \textbf{serial unsigned multiplication} circuit that needs as many cycles as the position of the most-significant non-null bit in the second operand. \item A \textbf{serial unsigned division circuit} that computes one digit of the quotient and one digit of the remainder per cycle. It relies on the long division algorithm every child has learnt in school. The algorithm and the source code are given in example in Figure~\ref{divalg} and Figure~\ref{divcode}. \end{itemize} We first tried to keep all the instructions working on one cycle. This is feasible concerning the multiplication, by using $16$ addition circuits working in parallel. However, the generated overhead in the netlist is of the same order of magnitude as the total size of the CPU without this feature. Moreover, it appeared it was neither natural nor efficient to write a parallelized version of the division circuit. Therefore, we changed the ALU in order to allow a few instructions executing on several cycles, by adding two bus \prog{start\underscore{}signal} and \prog{work\underscore{}remains}. See Figure~\ref{divcode} for an example. Arithmetic overflows are not handled, and the behaviour of the ALU is unspecified in the case of a division by $0$. Signed multiplication and division are not yet implemented, but it would be easy to use some multiplexers and negation circuits to reuse the unsigned version of these circuits, computing the sign of the result independently. In regard to the simulator's implementation, there's no need to optimize the depth of the different circuits but the total number of gates. However, some functions that are used to generate circuits working on \emph{Redudant Binary Representation} were left in the source code. \begin{figure}[p] \textsc{Long-div} : computes the remainder \prog{R} and the quotient \prog{Q} of the division of \prog{N} by \prog{D}: \medskip \begin {lstlisting}[basicstyle=\ttfamily, frame=lines] Q := 0 R := 0 for i = n-1 to 0 do R := R << 1 R(0) := N(i) if R >= D then R = R - D Q(i) := 1 end end \end{lstlisting} \caption{The binary long division algorithm} \label{divalg} \end{figure} \begin{figure}[p] \lstset{language=ML} \begin{footnotesize} \begin {lstlisting}[basicstyle=\ttfamily, frame=lines] let rec ndivu n a b start_signal = let next_busy, set_next_busy = loop 1 in let busy = start_signal ^| (reg 1 next_busy) in let dd, set_dd = loop n in let q, set_q = loop n in let r, set_r = loop n in let c, set_c = loop n in let c = set_c ( mux start_signal (shiftl1 n (reg n c)) ((const "0") ++ (rep (n-1) (const "1"))) ) in let q = mux start_signal (reg n q) (zeroes n) in let r = mux start_signal (reg n r) (zeroes n) in let dd = set_dd (mux start_signal (shiftl1 n (reg n dd)) a) in let r = (dd ** (n-1)) ++ (r % (0, n-2)) in let rq = mux (ule_n n b r) (r ++ ((const "0") ++ (q % (0, n-2)))) ((nsubber n r b) ++ ((const "1") ++ (q % (0, n-2)))) in let r = set_r (rq % (0, n-1)) in let q = set_q (rq % (n, 2*n-1)) in let work_remains = nonnull n c in let finished = set_next_busy (busy ^& work_remains) ^. (not work_remains) ^& busy in dd ^. c ^. q, r, finished \end{lstlisting} \end{footnotesize} \caption{The unsigned division circuit} \label{divcode} \end{figure} \subsection{The \vivace{} assembly} \label{sec:assembly} The \vivace{} assembly language is mostly inspired from the MIPS assembly language. An assembly program is made of a \prog{.text} and an optional \prog{.data} segment, both of which may contain labels that behave exactly like in MIPS, except that they may not begin with a capital letter. End-of-line characters are used as delimiters. Comments begin with \prog{\#} and end with an end-of-line character. \subsubsection{The \prog{.data} segment} Due to limitations of the simulator that we did not consider were worth the trouble to remove, the \prog{.data} segment only contains uninitialized data. Its main use is to declare labels for use in the \prog{.text} segment. Instead of formally describing the straightforward syntax of the \prog{.data} segment, we will provide a simple example: \begin{verbatim} .data label1: word 3 # three 16-bit words label2: # label2 is label1 + 6 byte 1 # one byte \end{verbatim} \subsubsection{The \prog{.text} segment} The \prog{.text} segment contains both instructions and read-only data. The programmer has to ensure that the program counter never reaches the data. Read-only data is declared in the following manner: \begin{verbatim} # here be instructions # they'd better end with a jump data: word 0b101010 42 '*' # three words with the same value word -1 # 0xFFFF byte -1 255 # 0xFF 0xFF string1: ascii "hello world!\n" # a trailing null character is automatically added string2: ascii "well hello to you good sir!" strings_addr: word string1 string2 # this will make things easier \end{verbatim} Instructions have between 0 and 3 operands, which are to be separated by spaces. Depending on each instruction, operands may be integers, labels or registers. Registers are referenced either by their name (\prog{Z}, \prog{A} etc.) or by their number (\prog{\$0}, \prog{\$1} and so on). Here is a list of possible instruction formats: \begin{verbatim} # format R3: add A B C add A B -5 # signed, 5 bits # format R2: move A B # format I: incri A 2 # signed byte # format J: j -5 # signed, 11 bits j label # format R: jr RA # RA is G # format MEM: lw A 2(D) # signed, 5 bits lw A addr_label # format H: hlt \end{verbatim} The first register is usually the destination register. Like in MIPS, the \prog{sw} and \prog{sb} instructions are exceptions to this rule. Here is a list of all instructions supported by the \vivace{} assembly language. The column specifies wether the instruction is actually supported by the processor or it is translated to elementary instructions. Elementary instructions may still be translated to several instructions, for example if the operands are too big, or if integer operands are used in a command that only accepts registers. In that case, the \prog{E} register is used to store the operands.\footnote{While the assembler will process all listed instructions, some of them may not have been implemented in the CPU itself.} \begin{center} \begin{longtable}{lccl} \toprule Name & Elementary? & Format & Comments \\ \midrule \prog{add} & Yes & R3 & \\ \cmidrule{1-4} \prog{sub} & Yes & R3 & \\ \cmidrule{1-4} \prog{mul} & Yes & R3 & \\ \cmidrule{1-4} \prog{div} & Yes & R3 & \\ \cmidrule{1-4} \prog{addu} & Yes & R3 & Unsigned version of \prog{add} \\ \cmidrule{1-4} \prog{subu} & Yes & R3 & \\ \cmidrule{1-4} \prog{mulu} & Yes & R3 & \\ \cmidrule{1-4} \prog{divu} & Yes & R3 & \\ \cmidrule{1-4} \prog{and} & Yes & R3 & \\ \cmidrule{1-4} \prog{or} & Yes & R3 & \\ \cmidrule{1-4} \prog{xor} & Yes & R3 & \\ \cmidrule{1-4} \prog{nor} & Yes & R3 & \\ \cmidrule{1-4} \prog{lsl} & Yes & R3 & \\ \cmidrule{1-4} \prog{lsr} & Yes & R3 & \\ \cmidrule{1-4} \prog{asr} & Yes & R3 & \\ \cmidrule{1-4} \prog{se} & Yes & R3 & “Set if Equal” \\ \cmidrule{1-4} \prog{sne} & Yes & R3 & \\ \cmidrule{1-4} \prog{slt} & Yes & R3 & “Set if Lower Than” \\ \cmidrule{1-4} \prog{sle} & Yes & R3 & “Set if Lower or Equal” \\ \cmidrule{1-4} \prog{sltu} & Yes & R3 & Unsigned version of \prog{slt} \\ \cmidrule{1-4} \prog{sleu} & Yes & R3 & “Set if Lower Than” \\ \cmidrule{1-4} \prog{incri} & Yes & I & Allows for larger integers than \prog{add} \\ \cmidrule{1-4} \prog{shi} & Yes & I & Shifts first operand by second operand; mostly useless\\ \cmidrule{1-4} \prog{j} & Yes & J & If operand is an integer, performs a relative jump \\ \cmidrule{1-4} \prog{jal} & Yes & J & “Jump and link;” return address is stored in \prog{G}\\ \cmidrule{1-4} \prog{jr} & Yes & R & \\ \cmidrule{1-4} \prog{jalr} & Yes & R & \\ \cmidrule{1-4} \prog{jer} & Yes & R3 & “Jump if (R2 and R3 are) Equal to (address stored in) Register” \\ \cmidrule{1-4} \prog{jner} & Yes & R3 & \\ \cmidrule{1-4} \prog{jltru} & Yes & R3 & \\ \cmidrule{1-4} \prog{jleru} & Yes & R3 & \\ \cmidrule{1-4} \prog{lra} & Yes & J & Instead of performing jump, loads target address in register \footnote{Labels may be loaded with command \prog{li}.} \\ \cmidrule{1-4} \prog{lw} & Yes & MEM & \\ \cmidrule{1-4} \prog{sw} & Yes & MEM & \\ \cmidrule{1-4} \prog{lb} & Yes & MEM & \\ \cmidrule{1-4} \prog{sb} & Yes & MEM & \\ \cmidrule{1-4} \prog{lwr} & Yes & R3 & Looks up at address R2 + R3\\ \cmidrule{1-4} \prog{swr} & Yes & R3 & \\ \cmidrule{1-4} \prog{sbr} & Yes & R3 & \\ \cmidrule{1-4} \prog{lbr} & Yes & R3 & \\ \cmidrule{1-4} \prog{hlt} & Yes & H & Infinite loop \\ \cmidrule{1-4} \prog{li} & No & I & Translates to appropriate \prog{li*} commands \\ \cmidrule{1-4} \prog{lilz} & Yes & I & Loads lower byte, zeroes upper byte \\ \cmidrule{1-4} \prog{liuz} & Yes & I & Loads upper byte, zeroes lower byte \\ \cmidrule{1-4} \prog{lil} & Yes & I & Loads lower byte \\ \cmidrule{1-4} \prog{liu} & Yes & I & Loads upper byte \\ \cmidrule{1-4} \prog{push} & No & R & Stores contents of register on top of stack \\ \cmidrule{1-4} \prog{pop} & No & R & Loads contents of register from top of stack \\ \cmidrule{1-4} \prog{move} & No & R2 & \\ \cmidrule{1-4} \prog{not} & No & R2 & \\ \cmidrule{1-4} \prog{move} & No & R2 & \\ \cmidrule{1-4} \prog{jz} & No & R2 & “Jump if Zero“ \\ \cmidrule{1-4} \prog{jnz} & No & R2 & \\ \bottomrule \end{longtable} \end{center} In addition to this, labels \prog{\underscore clock}, \prog{\underscore output} and \prog{\underscore input} are mapped respectively to the clock counter, the serial output and the serial input. Here is an example function that outputs a string on the serial output: \begin{verbatim} # PROCEDURE: ser_out_str # ROLE: write null-terminated string to serial output # ARGUMENTS: address of string in register A ser_out_str: li C _output _ser_out_str_loop: lb B 0(A) jz B _ser_out_str_ret sb B 0(C) incri A 1 j _ser_out_str_loop _ser_out_str_ret: jr RA \end{verbatim} \subsubsection{The assembler} The assembler itself is written in OCaml in a quite straightforward manner, using the \prog{ocamllex} and \prog{menhir} tools, which allows for great flexibility. \subsection{The simulator and the monitor} The simulator is written in C for performance reasons. The monitor is a C program, using the curses library for output to the console. The simulator and the monitor communicate via Unix named pipes (FIFO's), which are created in the files \prog{/tmp/sim2mon} and \prog{/tmp/mon2sim}. The synchronization of the two programs has been somewhat problematic, due to incorrect use of \prog{scanf} making the programs hang. \subsection{The operating system} The operating system (source file : \prog{os.asm}) is a simple collection of assembly functions, that implement the following features : \begin{itemize} \item Serial input/output \item Writing of string and integers to serial output \item Date calculation \item Parsing of elementary command-line calls to set the clock variables \item Unit tests for addition, substraction, unsigned multiplication and usigned division. \end{itemize} \section{Results and benchmarking} We have managed to run the processor sucessfully, and to make it display the date and time as expected. With only basic ALU operations implemented (basically, unsigned division and comparison), the netlist generated for the CPU is 5572 lines long without simplifications, and 2560 lines long after simplifications (see report on the simulator for details about these simplification passes). The CPU runs at a reasonnable frequency of more than $10khz$, which enables it to interact with the user almost instantly. \end{document}