Squeezing Neurons into Narrow Spaces: AI in QA

By Jessie Hobb On Jun 6, 2023

Today, AI based on neural networks is at a very interesting stage of its development. It has clearly taken off: we see numerous applications from reading CT scans to picking fruits. But adoption rates vary a lot. Recommendation engines, customer support bots, and other stuff that’s been called ”internet AI” are fairly widespread; we see it everywhere in our lives. However, most areas, including software testing, aren’t quite there yet.

What makes AI widely adopted and what’s tripping it up? Let’s take a look at our own field, software testing. Specifically, I want to see how well neural networks can generate automated tests, and what barriers there are to mass adoption of AI in our field. I’ll be looking first and foremost at the problems and difficulties not because I’m trying to be glass-half-empty, but because things that go wrong always tell us more about how a system works than things that go right.

Once we’re done generating tests, I want to look at the applications of AI in some other fields, like medicine — there’s a lot of work being done there, especially with reading scans. Maybe they have similar growing pains? And finally, for a bird’s-eye view, I’ll look at a completely different industry that has already matured and solidified. What can its experience tell us about the evolution of AI today?

With this perspective, perhaps we can stop the hype/disillusionment pendulum that plagues new technologies, and maybe get a glimpse of where AI is going.

Acknowledgments

Before we go on, I’d like to thank Natalia Poliakova and Artem Eroshenko for the time and expertise that they’ve lent me with this article. However, the responsibility for everything written lies squarely with me.

Also, a big thank you to Danil Churakov for the awesome pictures in this (and other) posts!

Generating Tests: Where It’s at Now

Today, ChatGPT, GitHub Copilot, and other AI-driven tools are being used quite widely to write code and tests, and we’ve recently had a very interesting interview about it. However, people are still mostly writing automated tests by hand, and the usage of neural networks is mostly situational. The rate at which they are hitting the mark in this field is still way below the human norm. A late 2020 paper achieved predicting 62% of assertion statements in tests, and that was definitely a brag-worthy result. According to the GitHub Copilot’s FAQ, its suggestions are accepted 26% of the time. People often say that its best use is as a quicker alternative to Google and StackOverflow. So — situational usage. For now, AI in software QA can’t boast the same level as in, e. g., marketing, where digital marketing and data analysis strategies created by neural networks are actually superior to those created by humans. However, keep in mind that things might change with GPT-4.

Where do these issues stem from? Let’s ask around and do some hands-on research. I’ve talked to my experienced tester colleagues and also tinkered with Machinet, an extremely popular plugin for IDEA that generates unit tests.

The neural networks we’ve worked with are very useful for their intended applications. Machinet was likely created as a developer assistant rather than an all-purpose tool. Their motto definitely sounds like it’s aimed at devs rather than QA: ”Let’s face it. Unit testing is a chore. Just automate it.” This makes sense — it’s devs who usually write the unit tests. And for a person without deep knowledge of testing, the tool does an excellent job. It’s only when we’re looking at the results from a QA perspective that the faults become visible. So what issues were there?

The Issues

The network at work

Excessive Coverage

The Machinet plugin does its job, and does so thoroughly – it makes sure to cover if-else branches, uses different data types, checks character limits, etc. As a matter of fact, it does the job too thoroughly and can generate too many tests, with no regard for resource load. For instance, checking that a field accepts any character and checking that it restricts string length to 255 characters can be easily done inside a single test — but Machinet generated a separate test for each case. Another example involved a data retrieval function that was used by several API methods. The generated tests would cover the function in every one of the methods, and a separate test would be created for the function itself. Again, we’re clearly getting excessive coverage.

The impression was that if we had several intersecting sets of data, the plugin would try to cover all possible intersections. Say you’re testing an e-commerce website, and you’ve got multiple options for delivery (customer pick-up, payment type, etc.). With Machinet, all possible combinations will probably be tested, whereas a human tester could use a decision table to cover the combinations that really matter. However, doing this would probably require contextual data that isn’t fed into the neural network.

Code Repetition

A related issue is code repetition. We were testing a collection that needed to be populated with objects, and it had to be used in 4 different tests. Each test re-created the collection from scratch, instead of moving it into a separate field or a function. Moreover, a function for generating the collection already existed in the code, and it did essentially the same thing that was done in the tests, but it wasn’t used by the plugin. How well could a neural network use something that local? It would be tempting to say, well, this is a unique situation that the network has never encountered before. But it’s not like a neural network would have any trouble recognizing a unique photo of a cat that it has ever seen before. This is an interesting problem that we’ll return to below.

Understanding Complex Code

The automated tests generated by neural networks are at their best when they cover simple, modular pieces of code. Of course, it would be awesome if all code was written that way, but that might not be possible or feasible. And when you need to cover a long and complex piece of business logic, the reliability of machine-generated tests suffers a lot. Sure, they will be generated, but it will be difficult to figure out if they cover things properly or if it’s just random stuff that can barely be counted on for smoke testing.

Understanding Context

I’ve touched upon this when speaking about excessive coverage: the network doesn’t really know what the method is used for, and so can’t optimize based on that knowledge. For instance, a human tester knows that if we’re testing an API method in Java, we should check for null values; if it isn’t API, this can be skipped. The networks we’ve seen don’t seem to make decisions like that.

Usage of External Tools and Patterns

We’ve also noticed that Machinet populated the collection with hard-coded values, instead of using a library for generating data randomly. I don’t know if this is too much of an ask, but there is a related issue: the neural networks we’ve worked with don’t do well with external dependencies, and using patterns is also an issue.

Cleanup

Finally, the code generated by the networks definitely needs to be thoroughly reviewed. Of course, all code written by humans needs to be reviewed too, this is standard practice, and the creators of AI-based tools always warn about this. This definitely has to be kept in mind when developing such tools: there will be lots of back-and-forth, and making editing easy should be a prime consideration. But there is another reason why this might be a problem. As I’ve said, Machinet seems to be aimed at developers who might not be too familiar with testing; so the person who will review the machine’s work likely won’t be able to correct all the flaws that I’ve just talked about.

The Reasons Behind the Issues

So, the tools we’ve been working with aren’t as good when deep knowledge of the local project is needed, don’t “understand” the context of methods they are testing, they aren’t good at optimizing code in terms of repetition or resource usage, they work best on code that is simple and modular, and the code they create tends to be this way too.

At first glance, these might seem like fairly deep issues, but maybe this is because we’re talking about “understanding” instead of talking about data. It is no surprise that the network isn’t able to optimize code based on resource consumption if it isn’t trained on data for RAM and CPU usage. ChatGPT has no knowledge of your particular project because it has not been trained on it. In the end, these hurdles aren’t intrinsic to code, it’s just a question of economics: whether or not it’s feasible to train a network on a particular set of data.

Is the Deeper Reason Code Itself?

However, there is a deeper issue here: code is different from human language. That problem has been well researched, and it’s been tripping up algorithms developed for natural language for some time now. If we compare a natural language to a tool that you use mostly “as-is,” then a programming language is a tool that you use to build tools to build tools to build tools. It’s much more nested.

Best regards to XKCD

Every new function or variable that a programmer writes gets its own name, and these names can get really long and unintelligible — Java in particular is notorious for VariableNamesThatLookAndFeelLikeTrainWrecks.

This means that, e. g., the source code for Ubuntu can have two orders of magnitude more unique “words” (well, tokens) than a large corpus of texts in English. This is a very large and very sparse vocabulary; both are bad news for any algorithm. Rare words mean there is fewer data to learn from, and vocabulary size affects the speed and memory requirements of the algorithms.

These nested vocabularies are also very localized. Most words that you use inside your project will never see the outside. That’s the whole point of encapsulation: the less things stick out, the easier the code is to use. As a result, you’ve got application- and even developer-specific vocabularies. In addition, the tooling people use also changes constantly, and that breeds new vocabularies too. And aside from that, there is style: the code we write can be pretty opinionated. Was that thing you just wrote Pythonic? Or are you still dragging around your old Java habits? The same problem can be solved in different ways, creating further confusion.

So imagine your tech stack. You’ve got your language (maybe several languages), your framework, your database, all the libraries that you have at the front end and at the back end, as well as the patterns that you use, and the idiosyncrasies of the developers. All of these intersecting contexts severely limit the amount of similar code for any particular AI-driven tool. It’s no wonder, then, that such tools have trouble generating complex code.

The issue Is Deep, but Not Deep Enough

Sure, this is a problem, but it’s far from insurmountable. The thing is, within the context defined by language, framework, project, etc., code is actually much more repetitive than human language (also, see this). And that repetitive part is the dominating one. One study found that a large collection of JavaScript files contained 2.4 million unique identifiers, but only 10k of those were responsible for 90% of all occurrences.

So it’s not some deep issue with the data (our code), it’s a problem of how the data is organized and how it’s fed into a tool. Maybe it can be solved by using an algorithm that trains on sub-words, not words. Such an algorithm wouldn’t have to deal with million-word-long vocabularies, but it would be more resource-taxing. Or maybe it’s about having a model that can be re-estimated rapidly enough so that it can adapt to your local context. Or perhaps we should make contextual information like comments available to the algorithm. One way or another, the answer is always about how we organize the data and how quickly it can be processed.

Feeding the network

Zoom Out: Evolution of AI

Let’s try to put those issues into perspective, and see how they compare to difficulties that AI has had to overcome in other fields.

One way or another, the history of neural networks revolved around processing more data in less time. This is what I didn’t quite get at first: with neural networks, quantity is quality; the more data, the more complex and accurate are the answers.

People have been trying to build neural networks almost as long as there have been computers, but for most of that time, AI was dominated by a different school, the “rule-based” approach: tell your system the rules and watch it work; if it fails – ask an expert, make new rules. In contrast, neural networks figure out the rules themselves based on the data they process. So why did the networks only “take off” in the past decade or two? Because that’s when we got large-scale data storage (cloud in particular), insane volumes of generated data (through IoT, among other things), and huge processing power (esp. GPU computing); and that’s when several innovations happened that greatly improved the efficiency of training deep neural networks, from Geoffrey Hinton’s work in the mid-2000s to the invention of transformers in 2018. Long story short, it became possible to process lots of data quickly.

But just having the algorithms and hardware might not be enough. Healthcare is an area that gathers and stores large volumes of data, is always looking for cutting-edge tech, and attracts plenty of investment. So it’s no surprise that in some places and in certain fields (like ophthalmology), AI has already found its way into daily usage. However, there is still a long way to go before widespread adoption. In healthcare, unlike programming, there do exist models that outperform humans, but they mostly remain at the stage of development and keep failing in implementation.

The reason is that real-life data is “dirty” compared to the sets used to train the models. Currently, data in hospitals is all gathered and stored in different ways, a hospital usually has several systems that don’t talk to each other other than through pdf-files. And, of course, the data infrastructure differs from hospital to hospital, so you can’t build any large-scale datasets. Also, the patients can be of different demographics than those that were used to train the models. In addition, some signals that practitioners use are not digitized, which lowers the comparative efficiency of models. And there are issues of patient confidentiality when training models on local data sets.

Getting into the Narrow Spaces

Generally speaking, there are two ways around these obstacles: make the models fit the narrow spaces that real-life usage presents us with — or widen the spaces.

The large datasets that are used to train the top-performing models in healthcare are created through steps that can’t be reproduced in real life for proprietary reasons. So models could be calibrated on local environments to improve their accuracy, while synthetic datasets can be used to preserve confidentiality.

But none of this is likely to be very effective without changing the infrastructure used to collect and store data. The systems in place have to be able to interface with each other, data should be stored off-site, as proposed in the STRIDES initiative, and most importantly, it has to become more uniform.

The reasons behind the problems of adoption in code and healthcare are, unsurprisingly, very different. The solutions are less so. We can expect the models to become cheaper and more effective at being re-trained to fit local contexts. And we can change the way we organize the data. To make the networks working with code and tests more efficient, we need more standardized ways of writing and storing it. Many of the practices for what we consider “clean code” are aimed at making code more readable for humans. Maybe what we need now are ways of making it more readable for neural networks.

Or maybe it’s even simpler. If the neural networks can generate simple, repetitive code, and are good at covering it with tests — then perhaps we let them do both sides of the job and forget about clean coding practices altogether, instead concentrating on ways of controlling their output.

Summary

When a new technology comes along, there are two paths for adoption: first, the tech tries to fit some roles already existing in the economy; then, the economy changes to accommodate the new tech full-scale. The second path means that a web of infrastructure is created to support the tech. If you’ve got cars, you’re also getting mass construction of new roads, you get changes to legislation, etc. If you want electric cars, you need chargers.

Those are the two paths open before AI in QA right now that can help marry the tech to the infrastructure needed to support it:

1. Making AI-driven tools that are better able to fit the local contexts where they will be applied, tools that are able to learn on a particular project and tech stack

2. Accommodating the way tests are organized to the requirements of neural networks