Sunday, July 6, 2014

Understanding Cyclic Redundancy Check Implementation

  

Understanding Cyclic Redundancy Check Implementation

Abstract

The goal of write up is to understand the functionality and implementation of Cyclic Redundancy Check (CRC) schemes. The first part deals with division in Gallios Fields (modulo-2 division) which forms the basis of CRC and its implementation in hardware using 1-bit shift. The later part covers parallelization of CRC hardware treating n-bits at a time along with table based implementation.

CRC Concept

CRC is a scheme for error detection. In this scheme extra information, r is provided along with the original message, m. The extra information is related to original message m, by some function f. Parity is one such function wherein all the bits of message are ex-ored. For illustration and easy understanding, let’s take an example from arithmetic domain. Suppose we need to send 2 digits of information, where the information can have value in range 10 to 90. To form the redundant information we select a function of type module g. We divide message m by g and treat the remainder r as extra information. The complete transmission scheme then becomes {m,r}.  Ex.  If the message is 74 and we select g as 9 then the transmission stream would be {74, 2} as 74 mod 9 = 2 (f being mod). This message is transmitted in noisy environment with possibility of a digit getting changed. The receiver of this message would receive {m’,r’}. It then repeats the same procedure as transmitter of finding remainder r, dividing by g. If r’ matches r then m’ matches m, otherwise the received message is pronounced errored. The error detection capability depends on selection of g. If g is selected as 2, then the same stream would look like {74, 0}. If 74 gets converted to 76 with one digit change then receiver cannot detect the error as 76 mod 2 = 0 which is same as 74 mod 2.
The same concept when applied in digital with binary logic forms the basis for our CRC. Let m be the message to be transmitted consisting of k bits. We divide this message with another binary stream g, consisting of (n+1) bits to find out remainder of n bits. The rules for division are almost the same except that there is no carry here. The summary for rules
1 ± 1 = 0, 1 ± 0 = 1, 0 ± 1 = 1, 0 ± 0 = 0
This division is called modulo-2 division. The remainder r is always consists of n bits if divisor consists of n+1 bits. With remainder r, the transmit stream becomes {m, r}.

The division Procedure

Let the message m = 10010101, g= 1011. The division is performed as shown in Figure 1
Figure 1 Modulo-2 division

Hardware Implementation of division

Having understood the division lets proceed towards hardware implementation. The stage 0 does not need any explanation. At stage 1, we have 010 as remainder from stage 0 and 0 as brought down as a part of division procedure. We have 0100 to deal with. The first task is to find out next bit in quotient. It is very simple as next bit in quotient is always d2. The subtraction term is calculated by ANDing divisor with next quotient bit. In arithmetic division it obtained by multiplying divisor with new quotient term. Here, if qi is the next bit in quotient term, qi = d2 and Sub_m = { j : gj & qi } Or simply with ternary operator Sub_m = qi == 1 ? g : 0. In hardware it is going to be implemented with AND gates or multiplexer as
sub_m[0] = g[0] & qi
 sub_m[1] = g[1] & qi
sub_m[n] = g[n] & qi
This way we have found out sub_1 which is 0000 in this case. The next step is to find out stage2-r2. This is obtained by bitwise XORing stage1-r1 bits with sub_1 bits. The leftmost bit is always going to be 0 in the result. This bit has done its job and is thrown out (shifted out) while considering data for next stage, making one bit position available to pull one bit down from dividend.
d3 = stage1_r1 [3] ^ sub1 [3]
d2 = stage1_r1 [2] ^ sub1 [2]
d1 = stage1_r1 [1] ^ sub1 [1]
d0 = stage1_r1 [0] ^ sub1 [0]

stage2_r2 = d<<1 i.e. stage_2_r2 [3:1] = d [2:0]
stage2_r2 [0] = next term from dividend
The above sequence of operation needs to be repeated for remaining bits in dividend. It can also be observed that d3 is not used anywhere, so no need to implement it in hardware.
With above deductions the hardware for mod2 division for 4 bit divisor g looks as shown in Figure 2

Figure 2 Modulo-2 division using 1 bit shift
The hardware can be reduced for known value of g. The AND gates can converted to simple wire or open circuit based on value g[i] is 1 or 0 respectively.
The same logic can be extended for any n bit value of g.

Parallelizing division

The division discussed above inputs one bit every clock cycle. This may not be sufficient while handling data streams at high data rate where data arrives as in bigger chunks e.g.  8 bits/clock, 32 bits /clock.  In these cases, we should be able to handle multiple bits as input to division.
The Figure 3 helps in understanding the division treating a nibble at a time. The difference with the previous 1 bit approach is that we feed 4 bits into next stage. We need to find out 4 bit quotient in one clock cycle. The approach is to perform stage 0 to stage 3 in combinatorial way.

Figure 3 Nibble wide Module 2 division
 Consider the stage marked as ds_0, at this stage we have available with us 8 bits vector ds_0[7:0] and 4 bit divisor g[3:0] . In ds_0, bit 7 is always 0, next 3 bits ds_0[6:4] are the remainder from previous stage and remaining 4 bits ds_0[3:0] are the new bits from message stream. The problem is defined as finding q[3:0] and remainder r from ds_0 and g . Finally we need to express r in terms of ds_0 and g.
q3 = ds_0[6]                        As discussed in 1 bit division
Finding q3 is sufficient to define ds_1 vector.
ds_1[5] = ds_0[5] ^ g[2] & q3
ds_1[4] = ds_0[4] ^ g[1] & q3
ds_1[3] = ds_0[3] ^ g[0] & q3
Since ds_1[6] is not used for further calculations we don’t waste energy in finding it. ds_1[2:0] = ds_0[2:0]. For next stage similar process needs to be followed.
q2 = ds_1[5]                    This gives way to find out ds_2
ds_2[4] = ds_1[4] ^ g[2] & q2
ds_2[3] = ds_1[3] ^ g[1] & q2
ds_2[2] = ds_1[2] ^ g[0] & q2
ds_2[1:0] = ds_1[1:0]
On the same line, q1 = ds_2[4]
ds_3[3] = ds_2[3] ^ g[2] & q1
ds_3[2] = ds_2[2] ^ g[1] & q1
ds_3[1] = ds_2[1] ^ g[0] & q1

Finally q0 = ds_3[3]
r[2] = ds_3[2] ^ g[2] & q0
r[1] = ds_3[1] ^ g[1] & q0
r[0] = ds_3[0] ^ g[0] & q0

The above combinatorial logic (div_4 combi) can be used to calculate remainder of long stream of message as shown in Figure 4

Figure 4 Division of long stream
Same method can be extended to perform division treating 8, 16, 32, n bits at a time. As the value of n increases, the combinatorial logic increases thereby reducing the maximum frequency of operation for the circuit.

Cyclic Redundancy check

CRC checksum is equal to remainder obtained by dividing message stream m by generator polynomial g.  The message stream and generator terms are expressed as polynomials. g = 1001 is expressed as x3+1. There are many variants of expressing g based on assumptions that highest and lowest degree terms always have coefficient as 1 [1] .
We have seen the division of long message stream by g. CRC implementation uses the same division technique with a difference that input message stream m , is padded with zeros equal to degree of generator polynomial. This makes the implementation of check at receiver simple. When any message with checksum is received, the receiver calculates remainder on complete stream (including checksum) for correctly received message the remainder has to be 0.
Apart from above procedure, the communication standards may define the different procedures for calculation of checksum for corresponding communication. The standards also define generator polynomial. IEEE in standard 802.3 [2] defines generator polynomial for Ethernet communication as
 G(x) = x32 + x26 + x23 + x22 + x16 + x12 + x11 + x10 + x8 + x7 + x5 + x4 + x2 + x + 1   
Apart from padding 32 zeros, the standard requires complementing of first 32 bits of frame and complementing of remainder before sending out. The former requirement is achieved by preloading CRC calculation engine with all 1s.

Table based implementation

The approach is similar to the arithmetic division we do using tables.  Instead of having tables for all numbers, we need to have table only for G, as it is fixed for one CRC implementation. The depth of table depends on how many bits we want to handle in one cycle. For the previous nibble wide division, we need to handle 7 bits in one cycle (4 new bits from stream + 3 bits of remainder from previous cycle). This corresponds to 2k+1 bits where k is degree of G. Hence we need a table of depth 27 = 128. Each location index by 7 bit number ds, contains 3 bits of remainder that could be obtained by dividing ds by G. If ds = G x q + r, x -> mod2 multiplication, + -> mod2 addition, there the data structure that we would probably want to maintain is {q, r} for all possible values of ds. We can omit q as it does not play any role apart from finding remainder in the corresponding stage. The pseudo-code for this implementation would look as
ds_r[] ;  // Array of remainder obtained by dividing ds_r by G
const int k; // Degree of generator polynomial G
r = 0 is the remainder
while(m!= EMPTY) {
     ds[2k:k+1] = r ; ds[k:0] = m [0:k] ; m = m << k +1 ;
     r = ds_r[ds];
}
The table based approach is comparatively simpler to implement. If the table ds_r, is implemented in memory, it can be used for different generator polynomials by reprogramming. On negative side, the hardware requirement increases exponentially with increase in size of generator polynomial. The hardware requirement can be brought down by treating less number of bits independent of degree of generator polynomial at the cost of performance.

Summary

Cyclic redundancy check is remainder of division by generator polynomial. Out of the two implementations discussed here table based implementation seems more suitable for implementation in software. The former hardware implementation is more suitable for high bit rate requirements.

References

  1. Wikipedia, Cyclic Redundancy Check, http://en.wikipedia.org/wiki/Cyclic_redundancy_check
  2. IEEE Std 802.3-2008, Carrier Sense Multiple Access with Collision Detection (CSMA/CD) Access Method and Physical Layer Specifications