Thursday, August 21, 2014

Tight Rope Walker - A verification Scenario

My four year old daughter loves drawing. I got her a big join-the-dots drawing book, a fairly complicated with around hundred dots in a picture. On Monday morning she asked which one she should do. I glanced through the pages and asked her to complete a page titled Tight-Rope Walker. In the evening, she called to tell that she was done. Good! I asked her to colour it. Busy with work and customer, I could not review her work. Next day evening, I saw the completed with her own creative additions, too. I was nice, but not a tight-rope walker. There was no rope in drawing. The closest could be flying man.

She did not know, what a tight-rope walker was. I did not feel the need of telling it before starting the drawing.  I realized, at work, my condition is not different than her. Many verification engineers would empathize.  Many a times, at the end of a project, they have the is-that-so moment.  Unfortunate ones get this in the lab. 
When we see at verification process failures from top, we could categorize them into two (or may be one *). The first arise due to inability of the individual to understand the larger goal. Second, inability of management to predict verification holes i.e. the holes left due verification errors. We face these situation in tight scheduled project and relaxed scheduled projects (later being hypothetical!) 

Failure to understand larger goals

Separated by distances from Ayatollahs of the program, the necessary knowledge does not seem to percolate to the grass root.  In the tight schedules of projects, architects do not find time to talk to all the engineers to convey their thought process behind their implementation schemes.  Communication between marketing and engineers is not heard off.  With the genuine intent of knowledge sharing the channels are created. Every channel has its own channel loss. With hierarchy coming in picture, the loss is multiplied (added if you are in dB fan). The thought process is put on back burner and procedures are conveyed, join-the-dots.

The very basic of verification industry's existence is 'to err is human’. It is expected that design engineers would commit some errors, verification guys would find it. Who will cover the mistakes of verification engineers? There are tools. EDA companies are trying their best to come up with new tools and tactics to cross check verification engineers’ job.  With all these things, still there are bugs on silicon. There is need to better cover the verification engineers. There is need to expect verification engineer too is going to commit mistake. There is a need to channelize these mistakes to less important part. It wouldn't have mattered if my daughter would have missed drawing a window of a house in the background. But rope?!

Classical View

The issues encountered during and after verification can be grossly divided as
·         Architectural Issues
·         Feature related Issues
o   Unimplemented feature
o   Mis-interpreted feature
o   Mis-implemented feature

Architectural issues: These are the issues related to the main line functionality. The functionality realized with chain of operations spread across multiple modules, when gets a weak link , results in these kinds of issues. These are most undesirable of the lot to have on silicon.

Unimplemented features: Mostly all the features are implemented and verified correctly. There could be one which misses the implementation for various reasons. The issues that don’t find any mention in design as well as verification documents belongs to this category.
Mis-interpreted feature: These bugs are the result of incorrect interpretation or multiple interpretation of the same specifications. Precisely same misinterpretation by design and verification.

Mis-implemented feature: The features for which there is no understanding / interpretation issue, but went wrong during the course of implementation.

Architectural issues could be caught at proof of concept level verification.  We do some end to end test scenarios to make sure main line functionality is intact. The religious approach to mimic real world functionality can safest way taken to uncover these issues. Short cut taken for proof of concept, may prove a breeding ground for much hated bugs.   

For trapping mis-implemented feature bugs, EDA industry has provided plethora of tools and methodologies that include code coverage (block, condition, toggle, fsm), functional coverage (simple, cross), assertions, formal verification, etc. The amount of automation leaves no gap for these kinds of bugs. You make some fire with these tools and these insects will commit suicide. It has been made that simple.

Mis-interpreted feature bugs can only be caught if the interpretation of design and verification engineer is orthogonal. The tests aiming the feature would bring the issue to fore to be discussed with wider and wiser audience. The bug would get caught. Any agreement between design and verification about interpretation (read mis-interpretation) would make safe heaven for some bugs. The situation is grimmer when the features are explained by the design team and not the architects. Not all the bugs would successfully make to silicon, but they manage eating precious time.

A video processing team started working on a new project. A module was to operate on red component. Designer happily added Cr (Chroma Red) to its input list and implemented the functionality which happened to be some algorithm developed by software team sitting at far end of the world. Verification guys without a clue what they were verifying did the perfect job of making sure that the design matches what designer wrote in document. But they were supposed to operate on red (R) component in RGB and not Cr in YCbCr!
In some other project, a few day before tape out, architect asked for test case that would use different service categories for same types of customers probably with different SLAs. What a bouncer! None understood what it meant. Can we expect a bug free outcome?

These and there could be many other examples where a bug is spared or time is lost because of communication gap.