Saturday, 16 September 2017

The Problem of Productivity in Software Engineering Research

Software engineering research has a productivity problem. There are many researchers across the world who are engaged in software-engineering research, but the path from idea to publication is often a fraught one. As a consequence there is a danger that many important ideas and results are not receiving the attention they deserve within academia or finding their way to the practitioners whom the research is ultimately intended to benefit.

One of the biggest barriers faced by software engineering researchers is (perhaps ironically) the need to produce software. Research is overwhelmingly concerned with the development of automated techniques to support activities such as testing, remodularisation and comprehension. It is rightly expected that, in order to publish such a technique at a respectable venue, the proposed approach has to be accompanied by some empirical data, generated with the help of a proof-of-concept tool.

Developing such a tool requires a lot of time and effort. This effort can be roughly spread across two dimensions
(1) the `scientific’ challenge of identifying and applying suitable algorithms and data-types to fit the problem, and running experiments to gather data, and
(2) the `engineering’ challenge of ensuring that the software is portable, usable, and can be applied in an `industrial’ setting, to scale to arbitrarily large systems, to be used by a broad range of users.

Whereas the first dimension can often be accomplished within a relatively short time-frame (a couple of person months perhaps), the second dimension — taking an academic tool and scaling it up — can rapidly become enormously time-consuming. In practice, doing so will often only realistically be possible in a well-resourced and funded lab, where the researcher is accompanied by one or more long-term post-doctoral research assistants.

This is problematic because the second dimension is often what matters when it comes to publication. An academic tool that is widely applicable can be used to generate larger volumes of empirical data, from a broader range of subject systems. Even if the underlying technique is not particularly novel or risky, the fact that it is accompanied by a large volume of empirical data renders it immediately more publishable than a technique that, whilst more novel and interesting, does not have a tool that is a broadly applicable or scalable, and thus does not have the same volume of empirical data. I previously discussed this specific problem in the context of software testing research.

Indeed, the out-of-the-box performance of the software tool (as accomplished by dimension 2) is often used to assess at face-value the performance of the technique it seeks to implement (regardless of whether or not the tool was merely intended as a proof of concept). One of the many examples of this mindset shines through in the ASE 2015 paper on AndroTest, where a selection of academic tools (often underpinned by non-trivial heuristics and algorithms) were compared against the industrial, conceptually much simpler MonkeyTest  random testing tool. Perhaps embarrassingly for the conceptually more advanced academic tools, MonkeyTest was shown to be the hands-down winner in terms of performance across the board. I am personally uneasy about this sort of comparison, because it is difficult to delineate to what extent the (under-)performance of the academic tools was simply due to a lack of investment in the `engineering’ dimension. Had they been more usable and portable, with less dependence upon manual selection of parameters etc., would the outcome have been different?

This emphasis on the engineering dimension is perhaps one of the factors that contributes to what Moshe Vardi recently called the “divination by program committee”. He argues that papers are often treated as “guilty until proven innocent”, and the maturity and industrial applicability of an associated tool can, for many reviewers, become a factor in deciding whether a paper (and its tool) should make the cut.

In my view, this is the cause of a huge productivity problem in software engineering. The capacity to generate genuinely widely usable tools that can produce large volumes of empirical data is rare. Efforts to publish novel techniques based proof-of-concept implementations geared towards smaller-scale, specific case studies often fail to reach the top venues and fail to make the impact they perhaps should.  


In his blog, Moshe Vardi suggests that reviewers and PC members should perhaps adopt a shift in attitude towards one of “innocent until proven guilty”. In my view, this more lenient approach taken by reviewers should include a shift away from this overarching emphasis on empirical data and generalisability (implying the need for highly engineered tools).