Code Patching using Transformer Models

The repo for this project can be found here.

Professional developers spend around 70% of their time debugging. Of the approximately 70 buggy lines of code that developers write for every 1000 lines, 15 make it to production. Compilers and interpreters have been designed with this in mind, often providing programmers with error-fixing tips, but often fail at providing insightful clues on how to fix bugs.

Neural networks have become a candidate for bug patching. This work involves using transformer models to translate one-line Java buggy lines to bug free lines.

Data

We mined about 10,000 Java Github repos, and extracted 1 lines changes between pre-commit and post-commit. Since these changes occur between successive commits, we reason they represent logical bugs in addition to syntax or code-style errors. In the end, we obtained about 90,000 1 line changes.

Experiments

We treated this problem as a translation problem, specifically involving the translation of a buggy line to a bug free line. Contrary to natural language, source code is very structured and coherent. In addition, most buggy lines need only small changes to be converted to bug free lines. For these reasons, we reasoned transformer models would be well suited for this task. For this work, we experimented with the following 5 configurations:

Base Model with Beam Size 4 Direct use of the Tensor2Tensor Transformer model.
Base Model with Beam Size 10: Base model, but a larger beam size during the decoding phase.
Base Model with BPE: Base model with byte-pair encoding with a vocabulary size of 2,000. Software is unique from natural language in that the vocabularly size is almost unlimited. Developers routinely create new identifiers. To combat this, we reduced the vocabulary to 2,000 unique tokens.
Base Model with 2 line Context: Base model with context. In addition to the buggy-line, 2 lines of context are also fed into the model for both training and testing. We hoped this context would allow the model to reason about the cause of the bug.
Base Model with 1 line Context: Same as the above model, but only 1 line of context is fed in as input.

Results

Model	Beam Size	BLEU	Fix Accuracy
Base Model	4	77.9	53.3
Base Model, Beam Size 10	10	77.9	53.3
Base Model, with BPE (2k vocab)	4	82.3	54.6
Base Model, 1 line content	4	84.3	35.5
Base Model, 2 line context	4	84.8	33.5

Successful Translations

Input: @Exported ( name = ”str” )
Reference: @Exported ( name = ”str” , inline = true )
Model Output: @Exported ( name = ”str” , inline = true )

Input: String txt = yytext ( ) ;
Reference: }
Model Output: }

Input: public static String getDefaultAlias ( String aSourceName )
Reference: public static String getDefaultAlias ( String sourceName )
Model Output: public static String getDefaultAlias ( String sourceName )

Input: Map <String , DetectorNode >nodeMap = new HashMap <String , DetectorNode >( ) ;
Reference: Map <String , DetectorNode >nodeMap = new HashMap <>( ) ;
Model Output: Map <String , DetectorNode >nodeMap = new HashMap <>( ) ;

Input: ArrayList list = new ArrayList ( ) ;
Reference: ArrayList list = new ArrayList <>( ) ;
Model Output: ArrayList list = new ArrayList <>( ) ;

Input: Setting . byteSizeSetting ( ”str” , new ByteSizeValue ( 32 , ByteSizeUnit . KB ) , Property . NodeScope ) ;
Reference: Setting . byteSizeSetting ( ”str” , new ByteSizeValue ( 64 , ByteSizeUnit . KB ) , Property . NodeScope ) ;
Model Output: Setting . byteSizeSetting ( ”str” , new ByteSizeValue ( 64 , ByteSizeUnit . KB ) , Property . NodeScope ) ;

Analysis of Successful Translations

These 4 error fixes presented are by no means the full extent of the model. A full file of translations is available in the repository. These examples do indicate the model is able to learn:

Import/Export Syntax: The first example includes the use of the @Exported annotated type. The model is correctly able to recognize that this line is buggy because the designated function is an inline function. This bug fix will help developers write code that does not need to be compiled independently.
End of Function: The second example reveals how the model is able reason that a line of code is unnecesary, and returns ’}’. This ends control of the function.
Variable Naming: The third example does not constitute a synatx or logical error, but instead a code-style error. The developer named the parameter variable ’aSourceName’, which is not conventional. The model correctly returned ’sourceName’.
Constructor Initialization: The next two examples reveal the model is able to reason about syntactic structures. In particular, it is able to correctly initialize two objects with constructors: a Map object and an ArrayList object.
Common Logical Errors: The last example is a surprising result. The model is able to fix common logical changes, like changing a set length from 32 bytes to 64 bytes. On would think this is a fix that requires more context, but the model recognizes how often this change is made in the dataset, and accordingly adjusts.

Failure Modes

Input: private void checkIfClosed ( ) throws IOException {
Reference: private void checkIfClosed ( ) {
Model Output: private void checkIfClosed ( ) throws IOException {

Input: addMessage ( new Message ( MessageKind . INFORMATION , model . getModelUID ( ) + ”str” + ( genOK ? ”str” : ”str” ) ) ) ;
Reference: addMessage ( new Message ( genOK ? MessageKind . INFORMATION : MessageKind . WARNING , model . getModelUID ( ) + ”str” + ( genOK ? ”str” : ”str” ) ) ) ;
Model Output: addMessage ( new Message ( MessageKind . INFORMATION , model . getModelUID ( ) + ”str” + ( genOK ? ”str” : ”str” ) ) ) ;

Input: return 1 ;
Reference: return curSelection = = null ? 0 : 1 ;
Model Output: return 2 ;

Analysis of Failed Translations

These three errors are not comprehensize of all issues with the models presented, but highlight some key failure modes:

Deleting Code: Consistently, we notice our model fails to delete code because the majority of code changes during a pull request involve code addition, not deletion.
Complex Logical: Currently, even with a Transformer model, we are not able to fix long, complex logical statements. These issues may require a deeper architecture, a larger, more diverse training dataset, or both.
Return Logic: Return statements come at the very end of a function, method, or program. They represent the overarching logic behind the written code, and to reason about them will likely require deeper, more advanced architectures.

Conclusion

This work clearly shows that Transformer models are well suited for code patching. Their ability to reason with long-distance dependencies seemed promising in theory, and these experiments prove they indeed work in practice. In addition to mere syntactic fixes, these models are able to fix some style and even basic logical errors. We hope those who come after us will be able to further refine these models to solve increasingly complex bugs.