Friday 13 May 2016

In defence of the Siemens Suite: There is nothing wrong with evaluating testing techniques on "small" programs.


Software testing research has an (in my opinion somewhat perverse) obsession with "scalability". When producing empirical results, techniques that have been applied to larger systems are favoured over techniques that have "only" been applied at a unit level. I know this, because many reviewers of my own papers (where empirical results tend to be produced at a unit level) have indicated this.

Take this example from a paper (that was happily accepted despite this comment) from last year:

"The case studies are small (and include the infamous and deprecated Seimans suite... the authors might want to work on **MUCH** bigger programs for future versions of this work."

The reviewer is correct in one respect; the Siemens suite (of seven small-ish C programs) is widely derided. I remember it being contemptuously referred to as "the seven dwarves".

I believe, however, that this derision of small programs is irrational. To support this I put forward three arguments. These are elaborated below.

First, let us set aside the specifics of what is meant by "size". A"small" program can be an executable program with <200 LOC and a simple interface contained within a single module (e.g. TCAS). A "big" program can be a large executable consisting of >10,000 LOC spread across multiple modules, with a complex interface (e.g. Mozilla or JEdit, etc.).

For the sake of terminology, we'll take a "unit" to mean a single executable piece of functionality (e.g. a public method in a class).

Argument 1: To be useful in the industry, a testing technique or tool does not necessarily have to scale up to "large" units or systems.

Complex software systems are commonly decomposed into relatively small units of functionality. Even if some units do happen to be too large for a given technique to test, it might still be capable of handling the majority of smaller units. If this is the case, then it would save time and resources, and would surely be useful.

The argument against the Siemens suite is not necessarily just the size of the units,  but also that they are old -- much of the code originates from the mid nineties. But then again, so is the code in many industrial systems, which commonly contain many legacy components (that are much older and smaller) than the components in the Siemens suite.

Argument 2: Real bugs are often contained within small units of code that can be run in a self-contained manner.

If one looks at some of the "bugs" that hit the headlines in recent years, one can often surmise that these could have been localised to a specific unit of code.

Look at the Apple Goto-Fail bug and Heartbleed, etc.:

https://nakedsecurity.sophos.com/2014/02/24/anatomy-of-a-goto-fail-apples-ssl-bug-explained-plus-an-unofficial-patch/

http://martinfowler.com/articles/testing-culture.html#heartbleed

These are all contained within functions that are of a comparable complexity to TCAS. In fact, Martin Fowler (in the above link) makes this case - they could both have been detected by unit-testing.

In other words, automated tools and techniques that scale only to "small" programs can readily have a huge impact. It brings to mind the formal-methods techniques that have found bugs in the Java implementation of the TimSort algorithm - a reasonably small piece of code, with a bug that ended up affecting billions of android devices and PCs.

http://www.envisage-project.eu/proving-android-java-and-python-sorting-algorithm-is-broken-and-how-to-fix-it/

So, if "much bigger" programs are required, how much bigger should they be? What criterion would they have to meet? And why?

Argument 3: A "small" system can be enormously complex

The critical piece of functionality might be contained within 5-10 lines of code. But one of these lines might call a library function. One can consider as an example a Java method that uses collections, calls Math utilities, IO libraries etc. So a relatively small program could be functionally very complex because a substantial amount of its functionality is delegated.

Even if libraries do not come into play, it is still possible to have compact programs that, within a couple of hundred lines of code, include complex data manipulations, and control structures. I refer again to the above example of the Timsort algorithm - implemented again and again, widely studied, relatively small, but still buggy.

As soon as a piece of code becomes difficult to rapidly verify by manual inspection (which is very soon), an automated tool or technique becomes very useful. Hundreds of lines of code are not necessary for this, and I would argue that the threshold for a useful tool is much lower.

Conclusions


This emphasis on scale has a pernicious effect on research. It focusses efforts on techniques that can feasibly apply to large programs, at the expense of (potentially much more powerful) techniques that can be applied to smaller programs. As a consequence, many of the really fundamental testing problems that have not even been addressed with respect to "small" programs remain untouched.

This attitude also drives an (in my view artificial) wedge between the formal methods and the software engineering communities. Important techniques that have proven to be enormously valuable in practice such as model checking and theorem proving are often dismissed because they "don't scale", and only get consideration in a Software Engineering forum if they are framed in such a way that they can possibly be adapted to "real life" "big" programs.

However, this attitude is again misplaced, for exactly those reasons that I have listed above. These techniques are useful, and are accepted as such in practice. Formal methods events such as FM have an industrial presence that can probably match that of most large software engineering events. I can think of many "unscalable" formal methods techniques to have been successfully spun-off or absorbed into the working practices large organisations such as Facebook and Microsoft.

For many tasks, there are no convincing state-of-the-art techniques that even work on small programs. So getting a technique to work on a small program is important, and is in and of itself potentially useful to the industry.

Let's not dismiss the Siemens suite.