Learning with errors

5 mins read

Large-language models threaten to upend software and hardware development as we know it, but can they really deliver the goods? Chris Edwards investigates.

The large-language model (LLM) has seen a remarkable rise from obscurity to, apparently, forming as big a threat to skilled jobs as electricity formed to the Edwardian army of gas-lamp lighters. The neural-network layout that formed the basis of Google’s Bidirectional Encoder Representations from Transformers (BERT) technology appeared just five years ago in a model that demanded more than 300 million parameters to store the training data of some three billion words taken from public-domain books and Wikipedia, far outstripping the parameter count of Imagenet and similar convolutional neural networks that kickstarted the rebirth of AI.

Today, even the biggest implementation of the original BERT seems like a minnow compared to OpenAI’s variations on the core Transformer architecture. Today’s GPT-4 LLM stores 170 trillion parameters, a thousand times more than its predecessor. This model, and competitors, could reshape how organisations perform software and hardware development, though the claims made for the abilities of LLMs may prove far more optimistic than reality.

Next year, in addition to AI-driven tools that try to find ideal settings for existing layout tools, hardware designers will have a chance to try their hands on an LLM-derived tool aimed directly at them. Ansys is putting together a ChatGPT-powered agent designed to answer questions about the company’s products and wider engineering topics. However, software engineers have had access for several years to a quickly growing range of discrete and IDE-integrated tools based on LLMs that are intended to streamline code creation.

Microsoft-owned code-management specialist GitHub launched the original Copilot tool in June 2021, using an LLM derived from the earlier GPT-3. At that point, Copilot had a relatively simple remit: fill in blocks of core or act as a form of advanced autocomplete for fragments of code. According to GitHub, GPT-4 will form the basis of Copilot X, a new version that also encompasses test generation and documentation for raw code. According to GitHub, a million developers at more than 20,000 organisations worldwide had used the original Copilot assistant by June this year.

“I don’t know of anything that’s moved so fast in the developer realm,” Daniel Zingaro, associate professor of computer science at the University of Toronto, said in an online seminar in June organised by the Association for Computing Machinery (ACM). This rapid spread has worried many educators in computer science: generative AI potentially undermines many of the tests they set today in order to determine whether students understand the material. On the other hand, students would be right to be wary of chatbots answers.

A study by University of Auckland researcher Paul Denny and colleagues found in their tests based on courses that conform to the ACM’s Computer Science 1 grade, half the time Copilot handed over code that contained errors and did not quite do the job requested. However, students do learn quickly.

A study presented at the ACM’s Special Interest Group on Computer-Human Interfaces (SIGCHI) conference earlier this year, showed how the group that had access to GitHub’s AI software devoted some time to testing the code it produced. Zingaro asks rhetorically, “When was the last time you saw students testing code?”

Similar considerations are likely to affect how engineers incorporate these tools into the workplace in both computer science and hardware engineering. And the results may lead to quite different results to the productivity studies that claim dramatic improvements.

Productivity boost?

A paper published in February this year on the ArXiv preprint site by Microsoft Research economics researcher Sida Peng working with colleagues from GitHub and the Massachusetts Institute of Technology (MIT) Sloan School of management claimed developers that used Copilot on the relatively simple task of writing a basic webserver application completed it on average 56 per cent faster than the control group.

A study published in June by management consultancy McKinsey, based on the experience of 40 of its own developers, found the potential for similar improvements. The time to write new code almost halved in some cases and even refactoring existing code sped up by around a third.

However, it is not a consistent improvement in productivity. Task complexity plays a big role.

In the McKinsey study on its pool of developers, time savings shrank to less than 10 per cent on tasks that developers deemed high in complexity or where they had little experience with the programming framework used for the task. Where unfamiliarity is high, the AI tools can actually become a hindrance. Sometimes, tasks took junior developers up to 10 per cent longer with the tools than without them.

This last part mirrors the concerns of those in the hardware-design world.  Ramesh Narayanaswamy, principal engineer at Synopsys, for example, described his exploration of ChatGPT-like tools for EDA tasks at the Verification Futures UK conference organised by Tessolve in June.

“If you’re a new developer and the tool gives you a slightly buggy UVM testbench, you need to be a UVM expert to know there’s a bug in there and be able to clean it up.” Not only does this impose an overhead, he warns, “A less-experienced verification engineer might accept it and check in some garbage.”

Aside from the problems of AI models making illogical connections, their focus on natural language may prove a limiting factor when dealing with large code bases. Narayanaswamy says token limits create issues. Each token processed by an LLM roughly equates to a meaningful word and the current models impose a limit on the order of a few thousand.

“Maybe there are some with 8K tokens. If you try to ask about a Verilog module and say ‘summarize it’, having 8K tokens available isn’t much. You may have to chop it up in an intelligent way to handle the problem. You can’t throw a million lines of code at it and hope it will do something. It won’t. You need to do some clever stuff around it.”

Extensive clever stuff is how some researchers are trying to improve the capabilities of existing LLMs. To see how much the models can do, researchers have applied chain-of-thought prompting. This tries to overcome the lack of logical reasoning that LLMs do in developing an answer by structuring problems to force better answers. Such prompting improved GPT-3’s accuracy on arithmetic problems in the MultiArith dataset from less than 20 per cent to almost 80 per cent.

Some researchers see a future in organising prompts into libraries that make it easier for developers to obtain worthwhile results: in effect, building a high-level language that drives a pre-processor that delivers carefully worded prompts to the LLM. One issue may be that LLMs remain fast-moving targets. As researchers at the University of Maryland found earlier this year, many of the chain-of-thought sequences for arithmetic reasoning that worked for GPT-3 failed for ChatGPT.

In the meantime, a better strategy for engineers may be to ask LLMs to provide support for neglected parts of the development process rather than expect them to construct full code only to be forced to weed out the bugs and logical errors afterwards.

At Tessolve’s event, Matt Graham, project engineering group director, in the system verification group at Cadence Design Systems said, “Think about how many log files you have, how many waveforms or the other outputs of EDA tools. If you think about failure analysis, how often do you look at the log files of tests that pass. It’s not because we don’t think there’s good information. It’s just that we can’t [as humans] consume that much data. AI can potentially help us leverage that.”

Narayanaswamy agreed, “Nobody reads a log file but there’s probably one gem in there you would like to know about. The tools can help you.”