Friday 16 December 2016

Virtuous Loops or Tyrranical Cycles?

Feedback loops - the ability to continually refine and improve something in response to feedback - are innocuous, pervasive phenomena. For this blog post I’m particularly interested in instances where these loops have been deliberately built into a particular technology or solution to solve a specific problem. I’m writing this post because I want to try to put forward the following argument:

The most successful, `disruptive’ changes in society and technology have been (and are being) primarily brought about by innovations that are able to harness the power of this simple loop. 

Potted History

Throughout the 20th century, this simple loop has dominated technological progress. We will proceed to look at some examples. It is worth highlighting that these are treated both in a rough chronological order. However (and I don’t believe this is an accident) they are also ordered according to the rate at which feedback is provided; one of the reasons I am writing this blog entry is that (I believe) it appears that this rate is constantly increasing.

In the early 1900’s, Bell Labs were experiencing a lot of variability in the quality of their telephone wiring. Walter Shewhart, a statistician, was charged with devising a means by which to eliminate this variability. In order to refine the cabling process, he proposed the a cycle - later popularised by Edwards Deming - as the Plan-Do-Check-Act cycle. In essence, any procedure is devised as follows. You “Plan” your procedure, you “Do” it, you “Check” the quality of the outcome, identifying any problems, and you “Act” by refining the plan, and repeat the cycle.

The approach was one of the major innovations that formed the basis for the American manufacturing boom in the early 20th century. Shewhart, Deming and their colleagues travelled to Japan in the aftermath of the Second World War, and carried on to spread this ethos of continuous feedback. This set the foundations for the Toyota Production System and similar manufacturing procedures that underpinned the Japanese “economic miracle” and since inspired various modern equivalents, such as ``lean manufacturing’’ or Continuous Process Improvement. 

These principles eventually fed into the most enduring approaches within the software engineering. Incremental Iterative Development and their successors in Agile techniques all centrally play upon feedback. Regular cycles to enable continual feedback from the client about the product, often daily team meetings to enable feedback from developers, the increasing use of technologies such as Github, which enable feedback on individual code-changes, etc.

Within web applications continuous feedback from users to providers (or to each other) has become a core means by which to maintain and enforce service standards. One can think of the likes of Uber, AirBnB, and Ebay.

At an algorithmic level, most Machine Learning algorithms play upon some form of internal feedback loop. Two obvious examples are Genetic Algorithms and “Deep Learning” Recurrent Neural Networks. At their heart is a capability to continually modulate an inferred model by adapting to feedback. The last couple of years have seen enormous advances in Machine Learning technology. Computers can now beat humans at Go. They can recognise details in enormously complex patters, they can drive cars.  I do not think that this is because of enormous advances in terms of the algorithms themselves. It is because there has been a sudden surge in the availability of data, and specifically in the availability of feedback.

What next?

What links the emergence of powerful Machine Learning algorithms with the sudden rise of apps such as Uber? 

I believe that both have become as powerful as they are because it has been easier to collect data that can be channeled into feedback. This is as true for Machine Learning as it is for apps, where there are suddenly millions of users, all of whom have smartphones with virtually uninterrupted internet access.

The potential that could be gained from exploiting this loop first became apparent to me when I read a 2004 Nature paper: “The Robot Scientist”; this nicely embodied the feedback loop - replacing a human scientist with a Machine Learner with the goal of carrying out experiments (automatically via robotic equipment) to infer a model of a particular genetic pathway. In their loop the hypothesis model was refined after each experiment and tested in each subsequent cycle. At the time this appeared to be impossibly futuristic to me. We are now almost 13 years down the line, and the process of fully automated drug discovery has been realised.

To me, this renders the coming years as exciting as they are terrifying. For every noble cause, such as automated drug discovery, there is an equally disconcerting one, especially when humans become a direct part of this loop. An unrelenting, unwavering mechanised channel of feedback is necessarily reductive. If the feedback has direct implications for someone’s livelihood, the consequences can be brutal (c.f. the consequences for Uber drivers who receive consistently low ratings).

As organisations are seduced by the apparent “direct democracy” and self-regulating properties of these mechanisms, it is easy to see how, for its workers, this can turn into tyranny at the hands of a capricious, remote customer.


Even in my field of university teaching, similar changes are becoming increasingly tangible. Students are now customers, and their feedback on teaching (however subjective that may be) plays at least as much of a role as research in ranking departments and universities against each other, a trend that will no doubt become more pronounced with the emergence of the Teaching Excellence Framework. I noticed this year that the Panopto lecture-recording software, mandated across a growing list of UK universities, even has a feature that gives students the option (in a Netflix-esque way) to rate individual lectures out of five stars…

Monday 10 October 2016

Autonomous Vehicles in the UK: Beware the Robotised Drunk Driver



[Disclaimer: I have had no involvement with the development of autonomous vehicle software. Any concerns raised here are based on my generic software development knowledge, and might (hopefully) be completely unfounded with respect to autonomous vehicles in the UK.]

Much has been written about the interesting ethical questions that arise with the control systems in autonomous vehicles. Most of this has been a rephrasing of the various age-old gedanken experiments that could confront an autonomous AI. For example, the car is driving at speed, a pedestrian steps out in front of the car - the AI can either take evasive action into a lane of incoming traffic (posing a risk to the driver and oncoming vehicles), or it ploughs on into the pedestrian and kills them. What action should / can it take, and who is at fault when someone dies?”. 

Fundamentally, this elicits either a technical response (the car would employ some technology to prevent such a situation from ever arising). It can also simply be discarded with a legalistic response (the driver should always have their hands hovering over the wheel so that this is ultimately their responsibility). 

However, in my view there is another ethical question that cannot be shrugged off so lightly. Previous thought exercises have assumed that the software is behaving correctly according to the rules set out by its developers. The software could however easily contain bugs. One can imagine a pathological case, where a bug leads to a car veering off course, yet also prevents any interference from the driver. 

If this is possible, or even probable, is it ethical to expose drivers and the wider UK public to them?

Software systems within cars are enormously complex. As an illustrative example, well before the era of autonomous vehicles, the software system that controlled just the break and the throttle of a 2005 Toyota Camry amounted to 295 thousand non-commented lines of C code. Now, if we move on a decade, and consider the software that controls a modern autonomous car, it is orders of magnitude more complex.

Highly complex software systems become difficult to manage and understand. They become especially difficult to test rigorously, and impossible to verify. This is especially true if they include a lot of concurrency, non-determinism, Machine Learning systems, and rely upon complex sensor inputs (as is invariably the case with autonomous vehicles). 

The “pathological” example of a software bug causing a car to veer off course, beyond the control of a driver, is perhaps not as pathological as it seems. This is what happened with the Toyota Camry mentioned above (along with a range of Toyota and Lexus models up to 2010). Even though the software was merely in charge of a break and throttle control, it led to circumstances where the break became unresponsive and the driver was unable to slow down, and has been linked to “at least 89" deaths in the US. Subsequent inspection of the software showed it to be incomprehensible, untestable, a probable hotbed of bugs. Since then, we have witnessed most modern car manufacturers, regularly recalling hundreds of thousands of cars due to software defects. 

There is also no sign that this trend is about to abate in the case of autonomous vehicles. An autonomous vehicle with software bugs is akin to a drunk driver in robot form.
We have already witnessed collisions and even a fatality, caused by bugs in autonomous vehicle software. Google's autonomous car manoeuvred itself into a bus. There have been several reports of Tesla crashes - e.g. when a Tesla car "autopiloted" into a SUV, killing its passengers, or an auto-piloted Tesla crashed into a bus full of tourists, or auto-piloted into a truck, again killing its driver.

This is not necessarily due to poor practice by Google or Tesla. It is simply a brutal reflection of the fact that software is inevitably defect-prone. This has shown to be especially the case with the Machine-Learning oriented autonomous car software.

This leads to an ethical conundrum, not just for car manufacturers, but for governments who chose to offer themselves up as a testbed for this technology. By providing  “permissive regulations” for these vehicles to be trialled in cities across the UK, the British public is being unwittingly exposed to robots, not under the control of their drivers, and which are controlled by software that almost inevitably contains bugs. 

It is important to emphasise this - it is not just drivers themselves, but pedestrians, cyclists, and families with children crossing roads, that are being exposed. In academia, if this were an experiment and the general public were the subjects, it not come close to passing an ethics committee. 

My personal view is that our processes for quality assurance are not yet mature enough to provide adequate confidence in the correctness of such software systems.

I am completely unfamiliar of the QA processes that are mandated by the UK government in this instance. But, if the genie is to be released from the bottle, one has to hope that they have at the very least:

  1. Established, quality and trust models that specifically factor in the various properties of autonomous software that make it so particularly difficult to reason about.
  2. Mandated that the software artefacts in autonomous vehicles are inspected and verified by an independent third party, cognisant of the specific UK driving conditions that might not have been factored in to QA activities abroad, and that the reports from these verification activities are made openly available to the public.
  3. Are in the process of compiling a large number of test scenarios and associated data that is to be applied to all autonomous vehicles to be driven in the UK.
  4. Maintain a database of all autonomous vehicles in the UK so that if (when) dangerous software defects are detected, that these vehicles are mandatorily recalled to prevent them from causing harm.




Friday 2 September 2016

Blurring Boundaries in Computer Science


Research groups in Computer Science departments are awkward. I’m not talking about group members themselves, but the very notion of a group. 

Whereas traditional subjects such as Medicine have lots of nice, well-defined groupings (oncology, diabetes, etc.), this is not as clear for Computer Science, where new disciplines are continuously arising, and the boundaries between established areas seem to be gradually fusing. 

Let us take a set of typical department groups:

Formal Methods

Algorithms

Software Engineering

Machine Learning

Natural Language Processing

These areas all have their own journals and conference venues, their own superstars and core problems. 

But it is striking how much overlap there can be. Let us start with Software engineering. There are now countless Software Engineering papers that apply and build upon Machine Learning techniques. Natural Language Processing has a second home in Software Engineering, with the extensive use of Topic Modelling, LSI etc. Topic Modelling by the way is a technique in itself that straddles the areas of Machine Learning and Natural Language Processing. Formal Methods and Software Engineering essentially share the same goals. Theorem proving, which is a core Formal Method, is increasingly using Machine Learning techniques. Indeed many of the automated reasoning techniques that underpin theorem provers fundamentally share the same goals as a Machine Learning algorithms. All of these areas (FM, SE, ML, and NLP) revolve around Algorithms. 

Crucially, the expertise of individuals rarely sits squarely in one of these areas, but usually cuts across two or more of these areas. The traditional groups are simply no longer appropriate for pigeon-holing many researchers or research problems.

This is what makes Computer Science groups (and the discipline as a whole) awkward. The community (and departments) are very much split along these traditional, entrenched lines. Formal methods people operate in their own groups and publish in their own conferences and journals. As do Algorithms people, Software Engineers, etc. 

This post however argues that these boundaries are unnecessary and largely artificial. This always hits home when I get the opportunity to attend a conference that is slightly outside of my area. It is always striking how much scope there is for cross-disciplinary collaboration, how similar the fundamental problems are. One can’t help but wonder how much work is duplicated in different fields, and how much further we would be as a discipline if these boundaries simply didn’t exist.


Friday 13 May 2016

In defence of the Siemens Suite: There is nothing wrong with evaluating testing techniques on "small" programs.


Software testing research has an (in my opinion somewhat perverse) obsession with "scalability". When producing empirical results, techniques that have been applied to larger systems are favoured over techniques that have "only" been applied at a unit level. I know this, because many reviewers of my own papers (where empirical results tend to be produced at a unit level) have indicated this.

Take this example from a paper (that was happily accepted despite this comment) from last year:

"The case studies are small (and include the infamous and deprecated Seimans suite... the authors might want to work on **MUCH** bigger programs for future versions of this work."

The reviewer is correct in one respect; the Siemens suite (of seven small-ish C programs) is widely derided. I remember it being contemptuously referred to as "the seven dwarves".

I believe, however, that this derision of small programs is irrational. To support this I put forward three arguments. These are elaborated below.

First, let us set aside the specifics of what is meant by "size". A"small" program can be an executable program with <200 LOC and a simple interface contained within a single module (e.g. TCAS). A "big" program can be a large executable consisting of >10,000 LOC spread across multiple modules, with a complex interface (e.g. Mozilla or JEdit, etc.).

For the sake of terminology, we'll take a "unit" to mean a single executable piece of functionality (e.g. a public method in a class).

Argument 1: To be useful in the industry, a testing technique or tool does not necessarily have to scale up to "large" units or systems.

Complex software systems are commonly decomposed into relatively small units of functionality. Even if some units do happen to be too large for a given technique to test, it might still be capable of handling the majority of smaller units. If this is the case, then it would save time and resources, and would surely be useful.

The argument against the Siemens suite is not necessarily just the size of the units,  but also that they are old -- much of the code originates from the mid nineties. But then again, so is the code in many industrial systems, which commonly contain many legacy components (that are much older and smaller) than the components in the Siemens suite.

Argument 2: Real bugs are often contained within small units of code that can be run in a self-contained manner.

If one looks at some of the "bugs" that hit the headlines in recent years, one can often surmise that these could have been localised to a specific unit of code.

Look at the Apple Goto-Fail bug and Heartbleed, etc.:

https://nakedsecurity.sophos.com/2014/02/24/anatomy-of-a-goto-fail-apples-ssl-bug-explained-plus-an-unofficial-patch/

http://martinfowler.com/articles/testing-culture.html#heartbleed

These are all contained within functions that are of a comparable complexity to TCAS. In fact, Martin Fowler (in the above link) makes this case - they could both have been detected by unit-testing.

In other words, automated tools and techniques that scale only to "small" programs can readily have a huge impact. It brings to mind the formal-methods techniques that have found bugs in the Java implementation of the TimSort algorithm - a reasonably small piece of code, with a bug that ended up affecting billions of android devices and PCs.

http://www.envisage-project.eu/proving-android-java-and-python-sorting-algorithm-is-broken-and-how-to-fix-it/

So, if "much bigger" programs are required, how much bigger should they be? What criterion would they have to meet? And why?

Argument 3: A "small" system can be enormously complex

The critical piece of functionality might be contained within 5-10 lines of code. But one of these lines might call a library function. One can consider as an example a Java method that uses collections, calls Math utilities, IO libraries etc. So a relatively small program could be functionally very complex because a substantial amount of its functionality is delegated.

Even if libraries do not come into play, it is still possible to have compact programs that, within a couple of hundred lines of code, include complex data manipulations, and control structures. I refer again to the above example of the Timsort algorithm - implemented again and again, widely studied, relatively small, but still buggy.

As soon as a piece of code becomes difficult to rapidly verify by manual inspection (which is very soon), an automated tool or technique becomes very useful. Hundreds of lines of code are not necessary for this, and I would argue that the threshold for a useful tool is much lower.

Conclusions


This emphasis on scale has a pernicious effect on research. It focusses efforts on techniques that can feasibly apply to large programs, at the expense of (potentially much more powerful) techniques that can be applied to smaller programs. As a consequence, many of the really fundamental testing problems that have not even been addressed with respect to "small" programs remain untouched.

This attitude also drives an (in my view artificial) wedge between the formal methods and the software engineering communities. Important techniques that have proven to be enormously valuable in practice such as model checking and theorem proving are often dismissed because they "don't scale", and only get consideration in a Software Engineering forum if they are framed in such a way that they can possibly be adapted to "real life" "big" programs.

However, this attitude is again misplaced, for exactly those reasons that I have listed above. These techniques are useful, and are accepted as such in practice. Formal methods events such as FM have an industrial presence that can probably match that of most large software engineering events. I can think of many "unscalable" formal methods techniques to have been successfully spun-off or absorbed into the working practices large organisations such as Facebook and Microsoft.

For many tasks, there are no convincing state-of-the-art techniques that even work on small programs. So getting a technique to work on a small program is important, and is in and of itself potentially useful to the industry.

Let's not dismiss the Siemens suite.

Friday 18 March 2016

Are Mining & Machine Learning bad for "risky" Software Engineering Research?

Machine Learning and Data Mining are playing an ever greater role in Software Engineering. To an extent this is natural; one of the big Software Engineering challenges is to find ways of dealing with the vast amount of information in the source code, version repositories, stack overflow, etc., and ML/DM algorithms often provide a neat solution.

To an extent, this rapid rise has been prompted by (a) the preponderance of tools that can easily extract mine-able data from software (and repositories etc.), coupled with (b) the rise in easily accessible and customisable ML frameworks such as WEKA, which can be easily used off-the-shelf.

However, I believe there is another factor at play, which is (in my opinion) symptomatic of a potential problem. The quality of an SE paper is often assessed in terms of the scale of the systems to which it is applied (yes - other factors come into play, but these are often treated as secondary factors). To be counted as a "mature" research product, the realism and scale of the evaluation artefacts are key.

Obtaining data for large, real systems is often relatively straightforward (c.f. the PROMISE repository, etc.). Given that ML implementations abound, there are few significant technical hurdles to applying advanced ML algorithms to large volumes of data from real systems. If the novelty of the paper lies in the choice of an existing, implemented ML algorithm to achieve an established SE problem, it is relatively straightforward to produce a proof of concept and a convincing publication.

However, this comes at a potential cost to other research. What if the data in question are harder to obtain, and the collection requires some advanced tooling - e.g. program analysis? What if the data analysis algorithm in question is not available off-the-shelf, and needs to be designed and developed? What if the technique involves a lot of human input, which hampers data collection? I worry that, regardless of their technical contribution, these techniques can be overlooked in favour of techniques for which it is easier to obtain large amounts of data.

Although understandable in one sense, there is a danger. There is a danger that "risky" research is disincentivised if it requires tricky tooling, or requires data that is difficult to obtain. Conversely, research on problems for which data collection is easy, for which tooling is available, is incentivised. This not only risks turning SE into some form of applied data science (which I think does it a disservice), but also risks driving research away from some of SE's most interesting unsolved problems.

Perhaps the implicit charge is that SE research can at times be evaluated in terms that are reminiscent of "Technology Readiness Levels" in the industry. Research techniques at a high TRL are more publishable than research at a low TRL. By implication, research that uses established toolsets (of which there are many in ML), gets a head-start. This could come at a cost to "riskier" research, for which there are fewer established tool/data-sets available.



Friday 12 February 2016

Is “Software Analysis as a Service” doomed?

Imagine that you have developed an interesting new testing technique, software analysis algorithm, or visualisation approach. You want to do what you can to encourage practitioners and organisations to adopt it. For example, you might attempt to secure some funding to begin a start-up, to develop a self-sustaining company around the tool.

At this point, things become complicated. What business model will you choose? There is the traditional “client-side” approach; you produce a tool, you sell it to customers, or let customers download it for free and offer support for a fee. 

Then there is the “server-side” approach [1]. You host the tool on a central server, and clients can use the tool by uploading their data / giving your tool access to their code repositories. This approach has a huge number of benefits. It is easier to track access to your tool, there is a single point at which to apply updates, fix vulnerabilities, add features etc. You can collect accurate usage statistics, and can offer support in a more direct way. If you wish to come up with some sort of pay-per-use arrangement, you do not need to mess about with messy web-linked license files, as so many client-side tools still require.

It was surprising to me that such centralised tools are not more prevalent. 

The one major factor seems to be privacy and security. For many organisations, their source code can in effect represent their entire intellectual property. Making this available to a third-party represents an unacceptable risk, regardless of how useful an external tool might be. It is only in a few cases (e.g. Atlasssian’s Bitbucket), where there seems to be a sufficient degree of trust to use server-side tools.

Does this mean that, if you have a tool, you must immediately go down the route of a client-side IDE plugin? Otherwise you immediately seemingly rule out an overwhelming proportion of your user-base.

The flip-side is that, as soon as you release a client-side tool, it becomes harder to deploy within a large organisation, and your business case has to be overwhelming (which is not the case for most academic tool).

To me this is a huge “barrier to impact” for academic software engineering tools. Are there workarounds?

One way to operate the server-side option would be to perhaps have a “data-filtering” step that removes any sensitive information (e.g. identifiers) from the code, but maintains the information required for analysis. Though I’m sceptical that this would be enough to satisfy potential customers. 

Are there any more reliable cryptographic approaches that enable a client to use a server-side tool, without the server being able to make use of the data? Could there be something akin to the “Zero Knowledge Protocol” [1] that could be applied to server-side software analysis tools? 


[1] Ghezzi, Giacomo, and Harald C. Gall. "Towards software analysis as a service". Automated Software Engineering-Workshops, 2008. ASE Workshops 2008. 23rd IEEE/ACM International Conference on, 2008.

[2] Quisquater, Jean-Jacques; Guillou, Louis C.; Berson, Thomas A. (1990). "How to Explain Zero-Knowledge Protocols to Your Children" (PDF). Advances in Cryptology – CRYPTO '89: Proceedings 435: 628–631.

Tuesday 2 February 2016

Epistemology in Software Engineering and AI

Last week was a significant week for Computer Science. Engineers at Google demonstrated how their DeepMind AI engine was able to beat a European champion at Go - achieving a major milestone that at the very least matches the significance of Deep Blue beating Kasparov 1996. The week also saw the passing of AI pioneer Marvin Minsky.

I was struck by a blog-post by Gary Marcus that drew upon these two events. It touched upon the well-established rift between two opposed schools of thought within the field of AI. On the one hand you have the “Classical” school (espoused by Minsky), who argue that artificial intelligence must be a process of symbolic reasoning, underpinned by a logic and some formal framework within which to represent knowledge, such as Expert Systems, Inductive Logic Programming, etc. On the other you have systems that are not underpinned by logics or symbolic reasoning. Systems such as DeepMind are more “organic” in nature, relying entirely on the statistical analysis of training data (by way of mechanisms such as Neural Nets or Kernel Methods) to derive suitable models. The question of which school is right or wrong is something I’ll return to.

This split within AI is strongly reminiscent of a similar divide within my field of Software Engineering. On the one hand we have the “Formal Methods” school of thought which argues that it is possible develop a program that is “correct by construction” by the disciplined application of techniques that are founded upon logic and symbolic reasoning, such as program refinement and abstract interpretation, and that it is possible to establish whether or not an existing program is correct without having to execute it. On the other hand we have the “Proof is in the pudding” school of thought, which ignores the construction of the program, and focusses entirely on reasoning about its correctness by observing its input / output behaviour (i.e. testing).

In essence, both of these divides are concerned with the question of how one can establish knowledge. On the one hand you have the logicians who argue that knowledge is a product logic and reasoning. On the other hand you have the empiricists who argue that knowledge depends upon observation.

I find these divisions interesting, because they are fundamentally philosophical — they are concerned with the very question of what knowledge is and how it can come to be, otherwise referred to as Epistemology. The rift between the logicians and the empiricists is a continuation of a long-standing argument. This is nicely characterised by Popper in his volume on “Conjectures and Refutations” as follows:

“… the old quarrel between the British and the Continental schools of philosophy — the quarrel between the classical empiricism of Bacon, Locke, Berkeley, Hume, and Mill, and the classical rationalism or intellectualism of Descartes, Spinoza, and Leibniz. In this quarrel the British school insisted that the ultimate source of all knowledge was observation, while the Continental school insisted that it was the intellectual intuition of clear and distinct ideas.”

I find this link between Computer Science and Philosophy interesting for several reasons. The very fact that such an old argument of how to establish “truth” is still alive and well, and that it is being played out in the arena of Computer Science is exciting to me.

What is, however, just as interesting is the fact that this fundamental question of epistemology has been written about and debated for so long, and yet very little of this relevant epistemological history seems to filter through to papers within SE (and perhaps AI). For example, Dijkstra’s famous remark that testing can only establish the presence of bugs but not their absence, is merely re-stating an argument against the validity of inductive reasoning first put forward by Hume (that one cannot assume “that those instances of which we have had no experience, resemble those, of which we have had experience”) in 1740 (apparently - I admit, I didn’t read it, and am quoting from Popper).

Popper’s essay is interesting because he argues that neither of these two world-views adequately explains our “sources of knowledge and ignorance”. Submitting to the authority of logic alone can lead to false conclusions, as can submitting to the authority of empirical observation (Popper suggests that submitting to one school of thought over another “breeds fanatics”). He argues that the process of establishing and reasoning about truth is necessarily a combination of the two; a process of questioning authority (whether the authority of a logic or an observed piece of evidence), of forming conjectures and then seeking to refute them. A process of valuing a fact not by the amount of evidence that supports it, but by the amount of effort that was made to reject and undermine it.

Of course this all sounds intuitive and perhaps even obvious. In AI this lesson does seem to have seeped through. Returning again to Gary Marcus’ blog, the Go winning DeepMind system is not, as it might appear, a pure Neural Net construct. It is a hybridisation between Neural Nets and Tree-Search (a “classical AI” technique).   

In Software Engineering, however, the two tribes seem to remain quite separate (despite some notable efforts to combine them). I suspect that, when faced with the question of how to establish the correctness of a software system, a large proportion of academics would choose testing over FM or vice versa. The two approaches are often couched as (competing) alternatives, instead of techniques that are mutually dependent if the objective is to maximise the correctness and safety of a software system. 

References:

Popper, Karl. Conjectures and refutations: The growth of scientific knowledge. Routledge, 2014 (first published 1963).