Research Stuff: February 2016

Imagine that you have developed an interesting new testing technique, software analysis algorithm, or visualisation approach. You want to do what you can to encourage practitioners and organisations to adopt it. For example, you might attempt to secure some funding to begin a start-up, to develop a self-sustaining company around the tool.

At this point, things become complicated. What business model will you choose? There is the traditional “client-side” approach; you produce a tool, you sell it to customers, or let customers download it for free and offer support for a fee.

Then there is the “server-side” approach [1]. You host the tool on a central server, and clients can use the tool by uploading their data / giving your tool access to their code repositories. This approach has a huge number of benefits. It is easier to track access to your tool, there is a single point at which to apply updates, fix vulnerabilities, add features etc. You can collect accurate usage statistics, and can offer support in a more direct way. If you wish to come up with some sort of pay-per-use arrangement, you do not need to mess about with messy web-linked license files, as so many client-side tools still require.

It was surprising to me that such centralised tools are not more prevalent.

The one major factor seems to be privacy and security. For many organisations, their source code can in effect represent their entire intellectual property. Making this available to a third-party represents an unacceptable risk, regardless of how useful an external tool might be. It is only in a few cases (e.g. Atlasssian’s Bitbucket), where there seems to be a sufficient degree of trust to use server-side tools.

Does this mean that, if you have a tool, you must immediately go down the route of a client-side IDE plugin? Otherwise you immediately seemingly rule out an overwhelming proportion of your user-base.

The flip-side is that, as soon as you release a client-side tool, it becomes harder to deploy within a large organisation, and your business case has to be overwhelming (which is not the case for most academic tool).

To me this is a huge “barrier to impact” for academic software engineering tools. Are there workarounds?

One way to operate the server-side option would be to perhaps have a “data-filtering” step that removes any sensitive information (e.g. identifiers) from the code, but maintains the information required for analysis. Though I’m sceptical that this would be enough to satisfy potential customers.

Are there any more reliable cryptographic approaches that enable a client to use a server-side tool, without the server being able to make use of the data? Could there be something akin to the “Zero Knowledge Protocol” [1] that could be applied to server-side software analysis tools?

[1] Ghezzi, Giacomo, and Harald C. Gall. "Towards software analysis as a service". Automated Software Engineering-Workshops, 2008. ASE Workshops 2008. 23rd IEEE/ACM International Conference on, 2008.

[2] Quisquater, Jean-Jacques; Guillou, Louis C.; Berson, Thomas A. (1990). "How to Explain Zero-Knowledge Protocols to Your Children" (PDF). Advances in Cryptology – CRYPTO '89: Proceedings 435: 628–631.

Last week was a significant week for Computer Science. Engineers at Google demonstrated how their DeepMind AI engine was able to beat a European champion at Go - achieving a major milestone that at the very least matches the significance of Deep Blue beating Kasparov 1996. The week also saw the passing of AI pioneer Marvin Minsky.

I was struck by a blog-post by Gary Marcus that drew upon these two events. It touched upon the well-established rift between two opposed schools of thought within the field of AI. On the one hand you have the “Classical” school (espoused by Minsky), who argue that artificial intelligence must be a process of symbolic reasoning, underpinned by a logic and some formal framework within which to represent knowledge, such as Expert Systems, Inductive Logic Programming, etc. On the other you have systems that are not underpinned by logics or symbolic reasoning. Systems such as DeepMind are more “organic” in nature, relying entirely on the statistical analysis of training data (by way of mechanisms such as Neural Nets or Kernel Methods) to derive suitable models. The question of which school is right or wrong is something I’ll return to.

This split within AI is strongly reminiscent of a similar divide within my field of Software Engineering. On the one hand we have the “Formal Methods” school of thought which argues that it is possible develop a program that is “correct by construction” by the disciplined application of techniques that are founded upon logic and symbolic reasoning, such as program refinement and abstract interpretation, and that it is possible to establish whether or not an existing program is correct without having to execute it. On the other hand we have the “Proof is in the pudding” school of thought, which ignores the construction of the program, and focusses entirely on reasoning about its correctness by observing its input / output behaviour (i.e. testing).

In essence, both of these divides are concerned with the question of how one can establish knowledge. On the one hand you have the logicians who argue that knowledge is a product logic and reasoning. On the other hand you have the empiricists who argue that knowledge depends upon observation.

I find these divisions interesting, because they are fundamentally philosophical — they are concerned with the very question of what knowledge is and how it can come to be, otherwise referred to as Epistemology. The rift between the logicians and the empiricists is a continuation of a long-standing argument. This is nicely characterised by Popper in his volume on “Conjectures and Refutations” as follows:

“… the old quarrel between the British and the Continental schools of philosophy — the quarrel between the classical empiricism of Bacon, Locke, Berkeley, Hume, and Mill, and the classical rationalism or intellectualism of Descartes, Spinoza, and Leibniz. In this quarrel the British school insisted that the ultimate source of all knowledge was observation, while the Continental school insisted that it was the intellectual intuition of clear and distinct ideas.”

I find this link between Computer Science and Philosophy interesting for several reasons. The very fact that such an old argument of how to establish “truth” is still alive and well, and that it is being played out in the arena of Computer Science is exciting to me.

What is, however, just as interesting is the fact that this fundamental question of epistemology has been written about and debated for so long, and yet very little of this relevant epistemological history seems to filter through to papers within SE (and perhaps AI). For example, Dijkstra’s famous remark that testing can only establish the presence of bugs but not their absence, is merely re-stating an argument against the validity of inductive reasoning first put forward by Hume (that one cannot assume “that those instances of which we have had no experience, resemble those, of which we have had experience”) in 1740 (apparently - I admit, I didn’t read it, and am quoting from Popper).

Popper’s essay is interesting because he argues that neither of these two world-views adequately explains our “sources of knowledge and ignorance”. Submitting to the authority of logic alone can lead to false conclusions, as can submitting to the authority of empirical observation (Popper suggests that submitting to one school of thought over another “breeds fanatics”). He argues that the process of establishing and reasoning about truth is necessarily a combination of the two; a process of questioning authority (whether the authority of a logic or an observed piece of evidence), of forming conjectures and then seeking to refute them. A process of valuing a fact not by the amount of evidence that supports it, but by the amount of effort that was made to reject and undermine it.

Of course this all sounds intuitive and perhaps even obvious. In AI this lesson does seem to have seeped through. Returning again to Gary Marcus’ blog, the Go winning DeepMind system is not, as it might appear, a pure Neural Net construct. It is a hybridisation between Neural Nets and Tree-Search (a “classical AI” technique).

In Software Engineering, however, the two tribes seem to remain quite separate (despite some notable efforts to combine them). I suspect that, when faced with the question of how to establish the correctness of a software system, a large proportion of academics would choose testing over FM or vice versa. The two approaches are often couched as (competing) alternatives, instead of techniques that are mutually dependent if the objective is to maximise the correctness and safety of a software system.

References:

Popper, Karl. Conjectures and refutations: The growth of scientific knowledge. Routledge, 2014 (first published 1963).

Research Stuff

Friday 12 February 2016

Is “Software Analysis as a Service” doomed?

Tuesday 2 February 2016

Epistemology in Software Engineering and AI