ML strategies for automating code reviews

4 mins read

Code review is an important part of the software development life cycle in electronic design and using an established coding standard - such as MISRA or CERT - is essential to ensuring the quality of code being produced.

Coding standard guidelines capture many person-years of expertise and their use greatly reduces the probability of introducing errors. Guidelines may be decidable, where the answer is either a straight “yes” or “no” or undecidable, where the answer may be “yes”, “no”, or “possibly”.

To automate code reviews, it is important that an industry standardised static code analyser is used during development. However, even with the best static code analyzers, there can be an overwhelming number of defects reported, especially at the start of a project or when a new module is added. These defects may be genuine (true positives) or may be not (false positives).

Sophisticated dataflow algorithms are often necessary to perform the semantic analysis necessary to enforce undecidable rules in static code analysis. However, due to the complexity of these algorithms - along with the balance between performance and precision - there may be false positives reported. The question then arises: Should the algorithm run to completion, which may take several hours, or stop with a possible outcome that could be a false positive?

False positives are one of the main reasons given by developers for not using static code analysis tools. Identification and review of static analysis findings can be time consuming, particularly when it is a violation of an undecidable guideline.

The introduction of machine learning into static code analysis can assist in detecting code structures that may lead to false positives. This makes analysis more accurate and reduces the time needed to review. When pairing machine learning with static analysis, there are several possible strategies to consider:

Grouping/Clustering

An initial strategy is to group defects that are similar in nature so that they can be addressed together. This requires a similarity measure to be defined and calculated. The measure could be based on the identification of the defect or elements, like functions or variables found within the defect.

Some static analysers assign a severity level to each defect, which enables them to group together defects that are similar in nature at the same level. Defects could also be grouped by whether they are likely to produce false positives. The assignment of a defect to an appropriate level is dependent on the interpretation of the static code analyser developer. However, this may not guarantee a real relationship with every defect in the group.

This strategy can be enhanced by using an unsupervised machine learning clustering algorithm, such as the “K-Means” clustering algorithm. This is the simplest of unsupervised learning algorithms that automatically groups defects, and the idea of it is to define “K” centres, one for each cluster. The algorithm will iteratively compute the centre of clusters. Two defects may be intuitively similar if, for example, they have the same source, sink, or control flow because these defects take common paths. In other words, the more common aspects that the two defects have, the more similar the defects will appear to be to the user, and the faster they will be to review.

Ranking

This machine learning strategy determines the likelihood of defects being true positives and ranks them accordingly. Static analysis is based on both syntactic and semantic analysis of the code. The results from syntactic analysis are usually very reliable as the algorithms are decidable. However, syntactic analysis cannot identify complex defects and so semantic analysis is required.

As discussed previously, by using dataflow algorithms with semantic analysis can identify complex defects but are more likely to lead to false positives as the results can be undecidable. These complex defects may involve approximations. (It should be noted that there are other techniques employed by dataflow algorithms that are less prone to false positives. For example, SMT (Satisfiablility Modulo Theories) solvers, like Barcelogic and Yices.)

An approximation example would be multiplying a variable that is known to be within the range [0, 1000] by 2. It would be accurate to say that the result will be in the set of even numbers between 0 and 2000. However, it would be an approximation to say that it could be any number between 0 and 2000.

Depending on the number of approximations, these complex defects can be ranked, where a lower number of approximations results in a higher ranking. If the defects are then addressed in rank order, there is a possibility that a more complex defect would be fixed by just investigating the simpler one.

The limitation of this strategy is that to determine the number of approximations, it is necessary to have internal knowledge of the algorithms employed by the specific analysis tool.

AI-Assisted Ranking

A more general approach to ranking, that is not dependent on internal knowledge, is to prioritise defects that are likely to be true positives based on them being “similar” to defects reported in the past and confirmed to be real problems by a user.

In the same way, if a defect “looks like” a problem that has previously been reported and shown not to be an issue, likely false positives can be deprioritised by the ranking system.

This machine learning-based approach establishes similarity and assigns defects to a particular group: “True Positive Reports” (TP) or “False Positive Reports” (FP). These groups are based on results from a human review of defects that have been previously established.

Defects can be accurately mapped to the appropriate group and prioritised for further review by using “supervised machine learning algorithms”, which are a family of machine learning algorithms that build their statistical models based on sets of labelled examples. As more data becomes available, these algorithms are continually improved.

For example: The algorithm based on the Support Vector Machine (SVM) model that works with the labelled examples mapped into an Ndimensional feature space and attempts to separate them by a hypersurface. This results in examples that have different labels being located on different sides of the hypersurface. The model is solved for a hypersurface of a predefined shape by minimising the total distance from examples to the hypersurface.

Application of this algorithm has shown a high probability of accurate mapping. This indicates that supervised learning can be a useful tool for prioritising defects.

Conclusion

Using a combination of the strategies described, machine learning can improve the results of static analysis tools when used for post-processing analysis results reports. Applying clustering and ranking either separately or together will reduce the effort required by users to review defects and thus increase confidence and adoption of static analysis in the development environment.

Author details: Jill Britton is Director of Compliance at Perforce Software