Wishbone II - Zero-Stall High-Speed FPGA Transactional Bus

Wishbone II - Zero-Stall High-Speed FPGA Transactional Bus

Large FPGAs have open a new era of embedded design opportunities. High integration of one or more CPUs besides various and numerous peripheral interfaces requested the development of standard inter-connect scheme so called system on chip design. Wishbone specifications were introduced by the OpenCores and Silicore widely used as an open platform standard for seamless integration of various IP cores from various vendors. Here we represent a successor of the Wishbone, called Wishbone II which is introducing the new Transactional Bus Concept. Goal of the Wishbone II is to stay backward interoperable with the Wishbone B.3 and increase the system throughput up to the maximum by allowing multiple transactions at the same time reducing the bus stalls down to zero-cycles. In such a way maximum throughput is achieved even when interfacing slow and high latency peripherals.

Uros Platise, 22nd March 2008

Introduction

Companies as Xilinx, Altera and nowadays more and more stronger Lattice are providers of large scale Field Programmable Gate Arrays (FPGA). They feature millions of gates, thousands of functional units and large pin counts represent these devices as an ideal replacement for ASICs for low and medium volume production. Single chip solutions embedding multiple CPUs, memory controllers, communication interfaces, custom specific functions as complex math operations, and so forth, has forced designers to implement so called standard system-on-chip interconnection platform.

Xilinx has introduced On-Chip Peripheral Bus combined with the Processor Local Bus. Altera has introduced Avalon Bus with front GUI called SOPC Builder around which wide spread Nios system is built. Besides the Silicore with OpenCores have released the Wishbone System-on-Chip (SoC) specifications cite{OpenCores}, which nowadays represent the most wide spread SoC interface for IP exchange among the open-source hardware and others. It is also the bus Lattice uses it for its own Mico system as an answer to Altera Nios system.

In general these interconnect architectures are single transaction master/slave oriented, meaning that a CPU requesting a word from a given address stalls itself, and a path (bus) to the destination for as long word is not received. Lots of bus cycles are lost in this way giving lower actual data throughput than expected despite the relatively high system bus frequency. Even with fast burst reads and writes introduced by special signals, bus cycles are still lost until the first word is received at the additional cost of doubling the burst logic at both sides, source, and destination. Bus stalling is more evident when accessing slower modules with greater latencies. In these cases, system performance degrades dramatically; for example, a 100 MHz system may see its throughput fall as low as a few MB per second. That is why there was a desperate need to develop bus architectures employing new concepts.

In this paper Wishbone II is introduced. Based originally on Wishbone B.3 specifications because of the backward compatibility the new Wishbone II introduces the Transactional Bus concept, allowing multiple transactions to take place at the same time on several paths among the CPUs and peripherals. Wishbone II key advantages are highest throughput, decreased operating frequency, support for high latency peripherals which are either slow peripheral modules or new high-speed serial interfaces, and simple design with backward compatibility. Wishbone II may also be used on critical paths only of the existing embedded design to overcome throughput related problems in i.e. streaming applications: video, voice, telecommunications, display, and so forth.

Document is organized as follows: first transactional bus concept is represented followed by physical mapping to the new Wishbone II. Follows performance evaluation comparison to the Wishbone B.3 and conclusion.

Transactional Bus Concept

Base

Let us first represent the transactional bus concept by defining the transaction bus vector V, which delivers a message from source S to target T. The transaction bus vector V defines one-way path and consist of the following elements:

  • Source S
  • Target T
  • Operator O
  • Data D

The S and T are assigned unique identifiers and define the path. Sources are therefore generators of the transactional bus vectors and targets are receivers.

The operator O describes one or more operations to be executed along the path, or at the target T. Operations may also require supplemental data such as target address.

The data D represent the message to be delivered.

Transaction vectors V=V_0..V_N are placed onto so-called Transactional Bus B transporting vectors from sources to targets via virtual switch according to the S and T. On the way some specific operations may be executed already by B as set by O. The bus B is also responsible for priorities and scheduling.

Once sources S place vectors V on bus B the bus B takes the ownership and control of delivery to targets T, releasing the S. Each S may therefore issue another vector and so forth, according to the capacity of the bus B.

There are two kinds of vectors V:

  • Independent
  • Dependent

In the second case the order is important, meaning that the set of vectors V=V_0,…,V_N issued by the S must be delivered in the same sequence to the target T. This restriction regards to the implementation of the bus B and its virtual switch.

Four basic operations O are defined:

  • Single Data Read (SDR)
  • Single Data Write (SDW)
  • Bus (Un)Lock (BuL)
  • Cell (Un)Lock (CuL)

Single data read and write are issued at target T, while cell and bus locking operations are in the transactional bus domain B. Once target T issues operation O from vector V it in case of SDR returns result as new V’. The way it generates new V’ may be custom specific or it may simply take and exchange the S and T elements from original V.

Bus Handshaking and Error Handling

Transactional bus concept does not provide per cycle error notification and status mechanism as many vectors may be placed on the bus from various sources before they actually execute or fail to execute at targets.

From this perspective the bus B is responsible to:

  • implement hand-shaking between every source and target
  • internal vector forwarding from sources to targets
  • item error reporting thru other instances, as virtual Error Notification Target (ENT)

In such a way sources simply fill the bus with as many vectors the bus can take care of. In the case of an error, as are timeouts, unavailability or wrong target, the bus generates and issues an error message to the ENT, with the source S set as T in original message (employs redirecting technique described below) and T is set to the ENT. The ENT may be implemented as part of intelligent unit as CPU, etc., to which all error messages also from other peripherals may be gathered to. According to the S more of the ENTs may be implemented to group the error messages according to sources.

Composite Operations

Redirecting

The source S may generate a SDR vector in which it specifies another source S’, to which resulting message (data) is delivered by target T.

Examples of such operation are DMA transfers where the generic DMA generator S only issue move requests and actual data transfer takes place between the T and S’. Another example is the Error Notification Target (ENT) described above.

Burst Reads and Writes

Burst reads and burst writes are accomplished by issuing a stream of SDRs or SDWs placed one by another by the S. The target T therefore does not need to implement prediction logic nor specific burst logic.

RMW Cycles

The two bus operations BL/BUL and CL/CUL support global and local ownership mechanism, where BL gives the ownership of the complete bus B to specific S while CL only for given cell within the T.

RMW cycles are represented by the following sequence:

  • BL or CL
  • SDR
  • SDW
  • BUL or CUL

The CL/CUL are high-speed operations as they not stall other transactions on the bus B.

Wishbone II

In this section we map the transactional Bus concept onto the Wishbone B.3 platform. However some clarification has to be given first. Wishbone B.3 specifications declare master and slave devices, where master is bus initiator and slave is responder. In the new, Wishbone II, implementation we keep this master and slave terminology due to backward compatibility, however there are some key differences:

  • Master - Slave Interface specifies data transfer between the (S,B) pair and (B,T) pair separately, not directly. Meaning that B is self-sufficient for operation, and that S (master) triggers the B by placing new vectors, which in respect to S acts as slave. On the target side bus B acts as masters and delivers vectors to slave T.
  • Responses are from the theory above represented as separate vectors, however Wishbone communication is bidirectional and represents source and response vectors, where later is slightly reduced. From that perspective target T may respond (in case of SDR) as slave to B, which delivers a new response vector back to the source S or S’, as specified in the original source vector V. The input/output nature of the interface saves some signals as operator and target identifier which are typically not needed on feedback path. However one may implement modules interfaces as output only (S) or input only (T).
  • Bus B used master and slave interfaces because of the hand-shaking purposes only. Bus B is responsible for priority, scheduling, may scale frequencies and hence throughput, and so forth.

Mapping to the Wishbone B.3 is achieved by introducing the following new signals:

  • WB_ACW Write Acknowledge
  • WB_ACR Read Acknowledge
  • WB_TGA Address Tag in both directions
  • WB_ALK Address (Cell) Lock

In the further text prefix WB may be changed to WBM denoting a master interface, and WBS denotes a slave interface or can be left blank to describe any master or slave interfaces. Input signals are appended _I at the end and output signals with _O. The proposed bus discards the Wishbone B.3 ACK signal since its functionality is now split among the ACR and ACW signals. Complete basic signal descriptions for master and slave are listed in the following table. New signals are marked in bold.

DESCRIPTION MASTER SLAVE
Data from master to slave WBM_DAT_O WBS_DAT_I
Data from slave to master WBM_DAT_I WBS_DAT_O
Slave Address WBM_ADR_O WBS_ADR_I
Transaction Strobe WBM_STB_O WBS_STB_I
Destination Operation WBM_WE_O WBS_WE_I
Bus Lock WBM_LOCK_O WBS_LOCK_I
Write Acknowledge WBM_ACW_I WBS_ACW_O
Read Acknowledge WBM_ACR_I WBS_ACR_O
Address Tag Write WBM_TGA_O WBS_TGA_I
Address Tag Read WBM_TGA_I WBS_TGA_O
Address (Cell) Lock WBM_ALK_O WBS_ALK_I

The following sub-sections describe the four basic operations and the error notification target. Acknowledge and hand-shaking mechanism is in most cases identical to the Wishbone B.3 specifications and shall not be rewritten herein. For complete timing specifications please refer to the Wishbone B.3 Specifications.

Major exception in the timing and signal scheme is the hand-shaking of the SDW and SDR operations for which Wishbone II uses two separate acknowledge signals, that may acknowledge separately SDW and SDR at arbitrary times. Transaction strobe WBM_STB_O is valid in combination with WBM_ACW_I signal only.

Single Data Write (SDW)

A write transaction is similar to the write transaction given in the Wishbone B.3 specifications. The only distinction is that Wishbone II uses the ACW signal to acknowledge a write cycle. To issue a SDW the following signals must be set:

  • WBM_DAT_O Data to be written D.
  • WBM_TGA_O Source identifier S. May be omitted unless used for scheduling purposes by bus B.
  • WBM_ADR_O Target T and address within the T.
  • WBM_WE_O Target write operation O is set high.

An example of the three SDW cycles can be seen in the following figure, where first and last are zero-wait state and second has 1 wait state.

../_images/sdw.svg

Example of SDW cycles.

Single Data Read (SDR)

A read transaction is composed of two transactions: SDR vector is issued by source S and response vector is returned by target T. To issue a SDR vector the following signals must be set by source S:

  • WBM_TGA_O Source identifier S.
  • WBM_ADR_O Target T and address within the T.
  • WBM_WE_O Target write operation O is cleared low.

After completion target T generates response vector, which is received to the master input ports as follows:

  • WBM_DAT_I Read data D.
  • WBM_TGA_I Source identifier S.

An example of two SDR cycles can be seen in the following figure. Figure clearly shows the hand-shaking process first for request being issued by the source (master) and reception of the response vector.

Source identifier on return (WBM_TGA_I) is a copy of what has been sent to target (WBM_TGA_O). In this way source can tag data for quick distribution inside itself.

../_images/sdr.svg

Example of SDR cycles.

Bus (Un)Lock (BuL)

Is a bus operator. Unit may take owner-ship of complete system by issuing the bus lock signal:

  • WBM_LOCK_O Set high.

And releases the owner-ship by clearing it to low.

Cell (Un)Lock (CuL)

Is a bus operator. Transaction bus concept introduces advanced per cell or address locking feature to avoid stalling the rest of the transactions while supporting RMW cycles at full speed. Unit may take owner-ship of the individual cell or address by issuing the address lock signal:

  • WBM_ALK_O Set high.

And releases the owner-ship by clearing it to low. The WBM_ALK_O/I has the same timings as WBM_LOCK_O/I.

Error Notification Target (ENT)

The ENT collects all invalid vectors with aim to detect system errors by gathering them into one or more meaningful units. In this way, there is no additional overhead between the peripherals, which have detected an error - i.e. DMA, and the CPUs, which can finally handle the errors and react properly.

Error vectors may be issued by bus B or target T. When formed as response vector it is defined as follows:

  • WBM_DAT_I Source address S, MSB aligned.
  • WBM_TGA_I Error target identifier ENT.

When formed as source vector from master interface (or bus B) it contains slightly more information:

  • WBM_DAT_O Source address S, MSB aligned.
  • WBM_TGA_O Error target identifier ENT.
  • WBM_ADR_O Target T and address within the T.
  • WBM_WE_O Target write operation O is set high.

Another way of handling errors is to return dummy results. However for every SDR vector placed onto bus B data must be returned either to the ENT unit or dummy to the source S, and never ignored to prevent from dead-loops. The CPU issuing faulty vectors shall probably experience timeout on SDR. In that case it may exactly identify the source of error by looking into the ENT for given source. When system does not provide the ENT unit, error vectors must always be returned as dummy results. This could be handled by each module separately or by bus B.

The length of the WBM_DAT_I/O vectors is typical larger than WBM_TGA_I/O so that complete information of the source is included.

Evaluation

Figure below shows an example system with one pipe-line stage on write (input) and read (output) paths between the source (master) and destination (slave) devices. System has 1 cycle pipe-line on both directions; therefore, a request-response loop takes at least 2 wait cycles. Slave (memory) may also perform some internal management like refresh, which adds up to the total number of wait states.

../_images/read-example.svg

System latency due to pipe-lining.

../_images/read.svg

Wishbone II Read Cycles.

You can see that Figure above depicts a transaction bus data flow diagram for the given example of the three read request transactions placed by the master as AD0, AD1, and AD2, and the associated returned read response transactions as DO0, DO1, and D01. The signal WE is assumed to be cleared for all three transactions to indicate read operations. Transactions AD0 and AD1 are burst transactions, meaning that AD1 = AD0 + 1, and the AD2 is an independent transaction triggered meantime that could be a cause of an external interrupt that loads its interrupt vector, and so forth.

Each read request transaction is acknowledged by the ACW signal, and the returned read response transaction is marked (acknowledged) by the ACR signal. Note that the latency order may not be the same, due to other higher priority master(s) or memory refresh functions, and so on. In the previous example, the AD0 is immediately acknowledged but it takes 3 wait cycles to return the DO0; the AD1 is acknowledged one cycle later while the DO1 is returned in 2 wait cycles only, and the DO2 again takes 3 wait cycles. All three transactions are completed in 7 cycles; theoretically, without adding two illustrative wait cycles, they would complete in 5 cycles only. Using the Wishbone B.3 specifications, the same scenario is shown in Figure below.

../_images/read-classic.svg

Wishbone B.3 Classic Read Cycles.

Where again AD0 and AD1 are bursts, AD1 = AD0 + 1, and the AD2 an independent request. All three transactions are completed in 11 cycles, compared to 7 cycles in Wishbone II with the same number of wait states.

Imagine a continuous burst Wishbone II would perform with 0 stall cycles and absolutely no bus stalls even when more than just one master coexist in the system. To be more illustrative for a system running at 150 MHz, long bursts with fixed latency of 2 cycles would yield a Wishbone II bandwidth of 150 Mwords, and Wishbone B.3 of 50 Mwords only.

Conclusion

The transparent architecture reflects itself as a simple input-output black box; however, the implementation is based on a multi-pipe-lined structure where each (FIFO) line holds one transaction vector.

The Wishbone II bus proposes an advanced transaction bus-oriented architecture for SoC designs for FPGAs and ASICs in which architecture write and read operations are handled as separate write and read transactions. Each transaction is stored in a single line and the multi-pipeline architecture acts as a FIFO buffer transporting multiple transactions from and to multiple source and destination modules. An advanced locking mechanism prevents the complete bus from stalling due to the RMW cycles using a temporary per-cell locking mechanism. In this way, overall design data throughput is increased just up to the maximum while the design successfully integrates slow and high-speed, low- and high-latency peripherals and CPUs.

Wishbone II is at the same time backward compatible and can interface all Wishbone B.3 slave cores.

References