Thursday, August 21, 2014

Tight Rope Walker - A verification Scenario

My four year old daughter loves drawing. I got her a big join-the-dots drawing book, a fairly complicated with around hundred dots in a picture. On Monday morning she asked which one she should do. I glanced through the pages and asked her to complete a page titled Tight-Rope Walker. In the evening, she called to tell that she was done. Good! I asked her to colour it. Busy with work and customer, I could not review her work. Next day evening, I saw the completed with her own creative additions, too. I was nice, but not a tight-rope walker. There was no rope in drawing. The closest could be flying man.

She did not know, what a tight-rope walker was. I did not feel the need of telling it before starting the drawing.  I realized, at work, my condition is not different than her. Many verification engineers would empathize.  Many a times, at the end of a project, they have the is-that-so moment.  Unfortunate ones get this in the lab. 
When we see at verification process failures from top, we could categorize them into two (or may be one *). The first arise due to inability of the individual to understand the larger goal. Second, inability of management to predict verification holes i.e. the holes left due verification errors. We face these situation in tight scheduled project and relaxed scheduled projects (later being hypothetical!) 

Failure to understand larger goals

Separated by distances from Ayatollahs of the program, the necessary knowledge does not seem to percolate to the grass root.  In the tight schedules of projects, architects do not find time to talk to all the engineers to convey their thought process behind their implementation schemes.  Communication between marketing and engineers is not heard off.  With the genuine intent of knowledge sharing the channels are created. Every channel has its own channel loss. With hierarchy coming in picture, the loss is multiplied (added if you are in dB fan). The thought process is put on back burner and procedures are conveyed, join-the-dots.

The very basic of verification industry's existence is 'to err is human’. It is expected that design engineers would commit some errors, verification guys would find it. Who will cover the mistakes of verification engineers? There are tools. EDA companies are trying their best to come up with new tools and tactics to cross check verification engineers’ job.  With all these things, still there are bugs on silicon. There is need to better cover the verification engineers. There is need to expect verification engineer too is going to commit mistake. There is a need to channelize these mistakes to less important part. It wouldn't have mattered if my daughter would have missed drawing a window of a house in the background. But rope?!

Classical View

The issues encountered during and after verification can be grossly divided as
·         Architectural Issues
·         Feature related Issues
o   Unimplemented feature
o   Mis-interpreted feature
o   Mis-implemented feature

Architectural issues: These are the issues related to the main line functionality. The functionality realized with chain of operations spread across multiple modules, when gets a weak link , results in these kinds of issues. These are most undesirable of the lot to have on silicon.

Unimplemented features: Mostly all the features are implemented and verified correctly. There could be one which misses the implementation for various reasons. The issues that don’t find any mention in design as well as verification documents belongs to this category.
Mis-interpreted feature: These bugs are the result of incorrect interpretation or multiple interpretation of the same specifications. Precisely same misinterpretation by design and verification.

Mis-implemented feature: The features for which there is no understanding / interpretation issue, but went wrong during the course of implementation.

Architectural issues could be caught at proof of concept level verification.  We do some end to end test scenarios to make sure main line functionality is intact. The religious approach to mimic real world functionality can safest way taken to uncover these issues. Short cut taken for proof of concept, may prove a breeding ground for much hated bugs.   

For trapping mis-implemented feature bugs, EDA industry has provided plethora of tools and methodologies that include code coverage (block, condition, toggle, fsm), functional coverage (simple, cross), assertions, formal verification, etc. The amount of automation leaves no gap for these kinds of bugs. You make some fire with these tools and these insects will commit suicide. It has been made that simple.

Mis-interpreted feature bugs can only be caught if the interpretation of design and verification engineer is orthogonal. The tests aiming the feature would bring the issue to fore to be discussed with wider and wiser audience. The bug would get caught. Any agreement between design and verification about interpretation (read mis-interpretation) would make safe heaven for some bugs. The situation is grimmer when the features are explained by the design team and not the architects. Not all the bugs would successfully make to silicon, but they manage eating precious time.

A video processing team started working on a new project. A module was to operate on red component. Designer happily added Cr (Chroma Red) to its input list and implemented the functionality which happened to be some algorithm developed by software team sitting at far end of the world. Verification guys without a clue what they were verifying did the perfect job of making sure that the design matches what designer wrote in document. But they were supposed to operate on red (R) component in RGB and not Cr in YCbCr!
In some other project, a few day before tape out, architect asked for test case that would use different service categories for same types of customers probably with different SLAs. What a bouncer! None understood what it meant. Can we expect a bug free outcome?

These and there could be many other examples where a bug is spared or time is lost because of communication gap.  

Sunday, July 6, 2014

Understanding Cyclic Redundancy Check Implementation

  

Understanding Cyclic Redundancy Check Implementation

Abstract

The goal of write up is to understand the functionality and implementation of Cyclic Redundancy Check (CRC) schemes. The first part deals with division in Gallios Fields (modulo-2 division) which forms the basis of CRC and its implementation in hardware using 1-bit shift. The later part covers parallelization of CRC hardware treating n-bits at a time along with table based implementation.

CRC Concept

CRC is a scheme for error detection. In this scheme extra information, r is provided along with the original message, m. The extra information is related to original message m, by some function f. Parity is one such function wherein all the bits of message are ex-ored. For illustration and easy understanding, let’s take an example from arithmetic domain. Suppose we need to send 2 digits of information, where the information can have value in range 10 to 90. To form the redundant information we select a function of type module g. We divide message m by g and treat the remainder r as extra information. The complete transmission scheme then becomes {m,r}.  Ex.  If the message is 74 and we select g as 9 then the transmission stream would be {74, 2} as 74 mod 9 = 2 (f being mod). This message is transmitted in noisy environment with possibility of a digit getting changed. The receiver of this message would receive {m’,r’}. It then repeats the same procedure as transmitter of finding remainder r, dividing by g. If r’ matches r then m’ matches m, otherwise the received message is pronounced errored. The error detection capability depends on selection of g. If g is selected as 2, then the same stream would look like {74, 0}. If 74 gets converted to 76 with one digit change then receiver cannot detect the error as 76 mod 2 = 0 which is same as 74 mod 2.
The same concept when applied in digital with binary logic forms the basis for our CRC. Let m be the message to be transmitted consisting of k bits. We divide this message with another binary stream g, consisting of (n+1) bits to find out remainder of n bits. The rules for division are almost the same except that there is no carry here. The summary for rules
1 ± 1 = 0, 1 ± 0 = 1, 0 ± 1 = 1, 0 ± 0 = 0
This division is called modulo-2 division. The remainder r is always consists of n bits if divisor consists of n+1 bits. With remainder r, the transmit stream becomes {m, r}.

The division Procedure

Let the message m = 10010101, g= 1011. The division is performed as shown in Figure 1
Figure 1 Modulo-2 division

Hardware Implementation of division

Having understood the division lets proceed towards hardware implementation. The stage 0 does not need any explanation. At stage 1, we have 010 as remainder from stage 0 and 0 as brought down as a part of division procedure. We have 0100 to deal with. The first task is to find out next bit in quotient. It is very simple as next bit in quotient is always d2. The subtraction term is calculated by ANDing divisor with next quotient bit. In arithmetic division it obtained by multiplying divisor with new quotient term. Here, if qi is the next bit in quotient term, qi = d2 and Sub_m = { j : gj & qi } Or simply with ternary operator Sub_m = qi == 1 ? g : 0. In hardware it is going to be implemented with AND gates or multiplexer as
sub_m[0] = g[0] & qi
 sub_m[1] = g[1] & qi
sub_m[n] = g[n] & qi
This way we have found out sub_1 which is 0000 in this case. The next step is to find out stage2-r2. This is obtained by bitwise XORing stage1-r1 bits with sub_1 bits. The leftmost bit is always going to be 0 in the result. This bit has done its job and is thrown out (shifted out) while considering data for next stage, making one bit position available to pull one bit down from dividend.
d3 = stage1_r1 [3] ^ sub1 [3]
d2 = stage1_r1 [2] ^ sub1 [2]
d1 = stage1_r1 [1] ^ sub1 [1]
d0 = stage1_r1 [0] ^ sub1 [0]

stage2_r2 = d<<1 i.e. stage_2_r2 [3:1] = d [2:0]
stage2_r2 [0] = next term from dividend
The above sequence of operation needs to be repeated for remaining bits in dividend. It can also be observed that d3 is not used anywhere, so no need to implement it in hardware.
With above deductions the hardware for mod2 division for 4 bit divisor g looks as shown in Figure 2

Figure 2 Modulo-2 division using 1 bit shift
The hardware can be reduced for known value of g. The AND gates can converted to simple wire or open circuit based on value g[i] is 1 or 0 respectively.
The same logic can be extended for any n bit value of g.

Parallelizing division

The division discussed above inputs one bit every clock cycle. This may not be sufficient while handling data streams at high data rate where data arrives as in bigger chunks e.g.  8 bits/clock, 32 bits /clock.  In these cases, we should be able to handle multiple bits as input to division.
The Figure 3 helps in understanding the division treating a nibble at a time. The difference with the previous 1 bit approach is that we feed 4 bits into next stage. We need to find out 4 bit quotient in one clock cycle. The approach is to perform stage 0 to stage 3 in combinatorial way.

Figure 3 Nibble wide Module 2 division
 Consider the stage marked as ds_0, at this stage we have available with us 8 bits vector ds_0[7:0] and 4 bit divisor g[3:0] . In ds_0, bit 7 is always 0, next 3 bits ds_0[6:4] are the remainder from previous stage and remaining 4 bits ds_0[3:0] are the new bits from message stream. The problem is defined as finding q[3:0] and remainder r from ds_0 and g . Finally we need to express r in terms of ds_0 and g.
q3 = ds_0[6]                        As discussed in 1 bit division
Finding q3 is sufficient to define ds_1 vector.
ds_1[5] = ds_0[5] ^ g[2] & q3
ds_1[4] = ds_0[4] ^ g[1] & q3
ds_1[3] = ds_0[3] ^ g[0] & q3
Since ds_1[6] is not used for further calculations we don’t waste energy in finding it. ds_1[2:0] = ds_0[2:0]. For next stage similar process needs to be followed.
q2 = ds_1[5]                    This gives way to find out ds_2
ds_2[4] = ds_1[4] ^ g[2] & q2
ds_2[3] = ds_1[3] ^ g[1] & q2
ds_2[2] = ds_1[2] ^ g[0] & q2
ds_2[1:0] = ds_1[1:0]
On the same line, q1 = ds_2[4]
ds_3[3] = ds_2[3] ^ g[2] & q1
ds_3[2] = ds_2[2] ^ g[1] & q1
ds_3[1] = ds_2[1] ^ g[0] & q1

Finally q0 = ds_3[3]
r[2] = ds_3[2] ^ g[2] & q0
r[1] = ds_3[1] ^ g[1] & q0
r[0] = ds_3[0] ^ g[0] & q0

The above combinatorial logic (div_4 combi) can be used to calculate remainder of long stream of message as shown in Figure 4

Figure 4 Division of long stream
Same method can be extended to perform division treating 8, 16, 32, n bits at a time. As the value of n increases, the combinatorial logic increases thereby reducing the maximum frequency of operation for the circuit.

Cyclic Redundancy check

CRC checksum is equal to remainder obtained by dividing message stream m by generator polynomial g.  The message stream and generator terms are expressed as polynomials. g = 1001 is expressed as x3+1. There are many variants of expressing g based on assumptions that highest and lowest degree terms always have coefficient as 1 [1] .
We have seen the division of long message stream by g. CRC implementation uses the same division technique with a difference that input message stream m , is padded with zeros equal to degree of generator polynomial. This makes the implementation of check at receiver simple. When any message with checksum is received, the receiver calculates remainder on complete stream (including checksum) for correctly received message the remainder has to be 0.
Apart from above procedure, the communication standards may define the different procedures for calculation of checksum for corresponding communication. The standards also define generator polynomial. IEEE in standard 802.3 [2] defines generator polynomial for Ethernet communication as
 G(x) = x32 + x26 + x23 + x22 + x16 + x12 + x11 + x10 + x8 + x7 + x5 + x4 + x2 + x + 1   
Apart from padding 32 zeros, the standard requires complementing of first 32 bits of frame and complementing of remainder before sending out. The former requirement is achieved by preloading CRC calculation engine with all 1s.

Table based implementation

The approach is similar to the arithmetic division we do using tables.  Instead of having tables for all numbers, we need to have table only for G, as it is fixed for one CRC implementation. The depth of table depends on how many bits we want to handle in one cycle. For the previous nibble wide division, we need to handle 7 bits in one cycle (4 new bits from stream + 3 bits of remainder from previous cycle). This corresponds to 2k+1 bits where k is degree of G. Hence we need a table of depth 27 = 128. Each location index by 7 bit number ds, contains 3 bits of remainder that could be obtained by dividing ds by G. If ds = G x q + r, x -> mod2 multiplication, + -> mod2 addition, there the data structure that we would probably want to maintain is {q, r} for all possible values of ds. We can omit q as it does not play any role apart from finding remainder in the corresponding stage. The pseudo-code for this implementation would look as
ds_r[] ;  // Array of remainder obtained by dividing ds_r by G
const int k; // Degree of generator polynomial G
r = 0 is the remainder
while(m!= EMPTY) {
     ds[2k:k+1] = r ; ds[k:0] = m [0:k] ; m = m << k +1 ;
     r = ds_r[ds];
}
The table based approach is comparatively simpler to implement. If the table ds_r, is implemented in memory, it can be used for different generator polynomials by reprogramming. On negative side, the hardware requirement increases exponentially with increase in size of generator polynomial. The hardware requirement can be brought down by treating less number of bits independent of degree of generator polynomial at the cost of performance.

Summary

Cyclic redundancy check is remainder of division by generator polynomial. Out of the two implementations discussed here table based implementation seems more suitable for implementation in software. The former hardware implementation is more suitable for high bit rate requirements.

References

  1. Wikipedia, Cyclic Redundancy Check, http://en.wikipedia.org/wiki/Cyclic_redundancy_check
  2. IEEE Std 802.3-2008, Carrier Sense Multiple Access with Collision Detection (CSMA/CD) Access Method and Physical Layer Specifications