Friday 18 March 2016

Are Mining & Machine Learning bad for "risky" Software Engineering Research?

Machine Learning and Data Mining are playing an ever greater role in Software Engineering. To an extent this is natural; one of the big Software Engineering challenges is to find ways of dealing with the vast amount of information in the source code, version repositories, stack overflow, etc., and ML/DM algorithms often provide a neat solution.

To an extent, this rapid rise has been prompted by (a) the preponderance of tools that can easily extract mine-able data from software (and repositories etc.), coupled with (b) the rise in easily accessible and customisable ML frameworks such as WEKA, which can be easily used off-the-shelf.

However, I believe there is another factor at play, which is (in my opinion) symptomatic of a potential problem. The quality of an SE paper is often assessed in terms of the scale of the systems to which it is applied (yes - other factors come into play, but these are often treated as secondary factors). To be counted as a "mature" research product, the realism and scale of the evaluation artefacts are key.

Obtaining data for large, real systems is often relatively straightforward (c.f. the PROMISE repository, etc.). Given that ML implementations abound, there are few significant technical hurdles to applying advanced ML algorithms to large volumes of data from real systems. If the novelty of the paper lies in the choice of an existing, implemented ML algorithm to achieve an established SE problem, it is relatively straightforward to produce a proof of concept and a convincing publication.

However, this comes at a potential cost to other research. What if the data in question are harder to obtain, and the collection requires some advanced tooling - e.g. program analysis? What if the data analysis algorithm in question is not available off-the-shelf, and needs to be designed and developed? What if the technique involves a lot of human input, which hampers data collection? I worry that, regardless of their technical contribution, these techniques can be overlooked in favour of techniques for which it is easier to obtain large amounts of data.

Although understandable in one sense, there is a danger. There is a danger that "risky" research is disincentivised if it requires tricky tooling, or requires data that is difficult to obtain. Conversely, research on problems for which data collection is easy, for which tooling is available, is incentivised. This not only risks turning SE into some form of applied data science (which I think does it a disservice), but also risks driving research away from some of SE's most interesting unsolved problems.

Perhaps the implicit charge is that SE research can at times be evaluated in terms that are reminiscent of "Technology Readiness Levels" in the industry. Research techniques at a high TRL are more publishable than research at a low TRL. By implication, research that uses established toolsets (of which there are many in ML), gets a head-start. This could come at a cost to "riskier" research, for which there are fewer established tool/data-sets available.