With the rise of deep learning in recent years, many code clone detection (CCD) methods use deep learning techniques and achieve promising results, so is cross-language CCD. However, deep learning techniques require a dataset to train the models. The dataset is typically small and has a gap between real-world clones due to the difficulty of collecting datasets for cross-language CCD. This creates a data bottleneck problem: data scale and quality issues will cause that model with a better design can still not reach its full potential. To mitigate this, we propose a tree autoencoder (TAE) architecture. It uses unsupervised learning to pretrain with abstract syntax trees (ASTs) of a large-scale dataset, then fine-tunes the trained encoder in the downstream CCD task. Our proposed TAE contains a tree Long Short-Term Memory (LSTM) encoder and a tree LSTM decoder. We design a novel embedding method for AST nodes, including type embedding and value embedding. In the training of TAE, we present an “encode and decode by layers” strategy and a node-level batch size design. For the CCD dataset, we propose a negative sampling method based on probability distribution. The experimental results on two datasets verify the effeteness of our embedding method, as well as that TAE and its pretrain enhance the performance of the CCD model. The node context information is well captured, and the reconstruction accuracy of the node-value reaches 95.45%. TAE pretrain improves the performance of CCD with a 4% increase in F1 score, which alleviates the data bottleneck problem.

In the development of software, to improve efficiency and reduce time cost, some existing codes are often copied-pasted or reused, which produces code cloning. Code cloning refers to two or more identical or similar source code fragments. Code cloning is pretty common, The percentage of which ranges from 5% to 70% [

Many CCD methods have been proposed, including traditional methods [

To mitigate this, we consider using unsupervised learning to pretrain the model on a large-scale dataset, then use the trained model to fine-tune in the small dataset. We use an autoencoder framework. The pretrain is done by doing an auto-encode-and-decode task on ASTs. It does not require any label and the dataset can be fetched from the internet as much as we want. By training an autoencoder through unsupervised learning in a large-scale dataset, a better result can be achieved in the downstream CCD task with the pretrain encoder, especially when the amount of labeled data is small. We make the following contributions:

We design a novel embedding method for AST nodes, including type embedding and value embedding. We design a tree-based distance and a tree-based Glove algorithm for type embedding, use a Gated Recurrent Unit (GRU) [

We propose a novel architecture that uses an autoencoder to pretrain on a large-scale dataset then fine-tunes the encoder in CCD. This will reduce the dataset requirement. We design a tree structure encoder and decoder, which combine as a tree autoencoder (TAE).

We present three training techniques for TAE, including “encode and decode by layers”, the use of node-level batch size and a tree splitting strategy. We propose a probability distribution based negative sampling method for the training on the CCD dataset.

We verify our proposed methods in a self-collecting dataset and an open-source cross-language CCD dataset. Experimental results verify the effeteness of the node embedding method, as well as that TAE and its pretrain enhance the performance of the CCD model.

In the next section, we will list some related works, then we will present our method in Section 3. In Section 4, we will present the experimental setup and results. Finally, we conclude our work in Section 5.

In this section, we will present some related works about CCD and some other approaches for vector embedding representation.

Recently, the rise of neural networks has led to better development of CCD. Many researchers have designed neural network models for CCD and achieved promising results. AST-based neural network (ASTNN) [

Clone detection with learning to hash (CDLH) [

For cross-language CCD, C2D2 [

Perez et al. [

CroLSim [

Vector representation refers to projecting objects (e.g., words, nodes, tokens) in high-dimensional space into a continuous vector space with a much lower dimension. It maps the object into a fixed dimension vector [

There are word embedding methods, including Word2vec [

Besides the word embedding, there are other vector representation techniques, some of which are designed for source code. Peng et al. [

The work of [

The overview of our method is illustrated in

Our method uses the AST of the source code. We first have to preprocess the source code: extract the AST, and do a transformation (i.e., refer to the AST syntax of the target language and simplify the AST). Inspired from the node representation format in [

Our type embedding method is based on Glove [_{i} and _{j} are the left and right embedding of the

If

If

The tree-based distance visually can be interpreted as that every two children of the same node are connected by a shortcut. It’s easy to prove that the proposed distance satisfies the three conditions of a legal distance definition (non-negativity, symmetry, and satisfying the triangle inequality).

We define a node’s sequentiality based on whether its children’s order matter: if a node’s children matter in order and the number of children is not fixed, we call it a sequential node (or it’s sequentiality), otherwise we call it a non-sequential node (or it’s non-sequentiality). We add five auxiliary losses according to five additional tasks. One is to predict a type’s sequentiality, other four are to predict special relationship: parent-child relationship, grandpa-grandchild relationship, near-sibling relationship (sibling nodes whose indexes differ by 1), near-near-sibling relationship (sibling nodes whose indexes differ by 2). Their losses denote as _{s}, _{a}, _{b}, _{c}, _{d}, requiring vector _{g}. The final loss is as follows:

_{i} is the statistics about type _{i}, can be positive or negative (depend on sequentiality). _{ij} is the statistics about _{i} and _{j} (_{i} is on the left of _{j}, “left” and “right” is based on the node order by DFS on the AST, the front is left, the behind is right), can be positive or negative (depend on the relationship between _{i} and _{j}). _{b}, _{c} and _{d} are similar to _{a}):

where _{k} is the co-occurrence matrix of distance _{k} grows extremely fast when _{d} is the normalization ratio used to prevent information flattening. Evidence can be found in [

The whole step of tree-based Glove is the following: “slice” the window on the ASTs to get all matrices and optimize

In the value embedding phase, we use a GRU + LSTM autoencoder framework. GRU encodes value into a fixed-length vector, and then LSTM decodes the vector to reconstruct the original value. In the encoding phase, two special tokens are added before and after the word to get the input sequence, including the start of the word and the end of the word, denoted as and . Then, feed the sequence into the encoder to get the output vector. In the decoding phase, distribute the output vector in time step and feed them into the LSTM, then apply the linear layer to the outputs to get the predicted sequence. The illustration can be seen in

Word classification: predict whether the word/value is an identifier, a real number or the others.

Character classification: predict each character’s class. We categorize 128 ASCII characters, and into 7 classes: , , digits, capital letters, lowercase letters, non-printing characters (control characters, ASCII range is in 0 ∼ 31) and other characters.

Finally, the whole loss is the following:_{n+1} is , _{w} and _{i} are the classification labels, and CE is the cross entropy.

Our TAE model includes an encoder and a decoder, both use a tree LSTM. LSTM [_{s} is the input (node embedding, concatenating of type embedding and value embedding) of node _{j}. Our Tree LSTM is shown in

The decoder of TAE contains a node embedding decoder (type classifier + value decoder) and a Tree LSTM Decoder (inner decoder + outer decoder + end of children (EOC) classifier):

where _{s} into _{sj} is a weight defined by _{s} to decide how much information flows from node _{j} can be obtained according to _{j} and _{j}. The decoder is depicted in _{s} is the children count of node _{root} is not used in _{children} is |

The encoding/decoding of each node depends on the encoding/decoding of its children/parent. The encoding/decoding has to be from down/top to top/down. With this, we introduce an “encode and decode by layers” strategy: group nodes by layers and encode/decode multiple nodes simultaneously in a layered fashion [

Our CCD model uses a siamese-based architecture neural network [

To create code pairs, first, sample a code fragment (anchor) and then sample a positive sample and

Such a candidate set makes sure its size is always 2ε|

In our experiments, we use two datasets: a pretrain dataset for node embedding and TAE pretrain and a CCD dataset for cross-language CCD [

Language | Filtration range | Node-type counting | ||||
---|---|---|---|---|---|---|

Node | Width | Depth | Seq | Non-seq | Total | |

Python | [6, 25000] | [4, 20000] | [3, 200] | 37 | 124 | 161 |

Java | [8, 30000] | [6, 25000] | [4, 200] | 46 | 177 | 223 |

Language | File count | Sum node count | Max node count | Max width | Max depth |
---|---|---|---|---|---|

Python (before filtration) | 153,870 | 171,567,690 | 1,046,594 | 521,225 | 1,009 |

Python (after filtration) | 141,473 | 152,991,959 | 24,952 | 16,916 | 182 |

Java (before filtration) | 422,182 | 517,881,807 | 1,555,420 | 982,463 | 1,246 |

Java (after filtration) | 421,435 | 477,452,394 | 29,982 | 22,852 | 200 |

We conduct experiments on a six-core window-10 machine of 16 GB memory with a Nvidia GeForce GTX 1650 GPU of 4 GB memory. Pytorch (

Node embedding includes type embedding and value embedding. The experimental settings are in

Experiment | Parameter | Value | Parameter | Value | Parameter | Value |
---|---|---|---|---|---|---|

Type embedding | Embedding dim _{t} |
32 | Window size | 5 | Batch size | 2048 |

Exponent |
0.3 | Momentum | 0.997 | Epochs | 5000 | |

Value embedding | 0 or 0.5 | 0.1α | Epochs | 100 | ||

Decoder hidden _{o} |
2_{v} |
Batch size | 512 |

For the value autoencoder training, we first filter some words whose length is bigger than a predefined threshold (here we use 100). And words with length no more than this threshold will be used as the base dataset. Then, the base dataset is split into a training set and a testing set at a ratio of 9:1. The fine-tuning learning rate of the char embedding layer is 0.0001. Four criteria are used to evaluate the models:

char_acc: characters level accuracy.

s_char_acc: characters level accuracy in strict mode.

word_acc: word-level accuracy.

len_acc: word length match accuracy, the ratio of whose are predicted in the correct length.

The experimental result is in

_{c} |
_{v} |
char_acc | s_char_acc | word_acc | len_acc | |
---|---|---|---|---|---|---|

8 | 64 | 0 | 90.87% | 84.07% | 74.89% | 99.80% |

8 | 64 | 0.5 | 90.84% | 83.34% | 74.07% | 99.78% |

8 | 128 | 0 | 95.45% | 91.68% | 86.22% | 99.64% |

8 | 128 | 0.5 | 95.03% | 90.86% | 85.09% | 99.47% |

16 | 128 | 0 | 92.84% | 87.31% | 79.34% | 96.43% |

16 | 128 | 0.5 | 91.87% | 86.13% | 76.94% | 97.53% |

In the training of TAE, the whole AST, which comes from the data preprocessing, will be involved. In the training phase, we process the source code on the fly via multi-processing. We filter the base AST data by max node count, the left AST data is also split into a training set and a testing set at a ratio of 9:1. To speed up the value embedding phase, we pre-compute the top 10000 most frequent value to build a lookup table. We use the scheduled sampling [

The ratio we used in

_{v} |
_{e} |
Inner | _{value} |
_{children} |
type_acc | eoc_acc | eoc_precision | eoc_recall | |
---|---|---|---|---|---|---|---|---|---|

64 | 512 | LSTM | 0.0283 | 0.0312 | 0.00216 | 97.56% | 98.77% | 98.96% | 97.56% |

64 | 512 | LSTM | 0.00468 | 0.0152 | 0.00219 | 98.71% | 98.73% | 98.96% | 98.66% |

128 | 1024 | LSTM | 0.0237 | 0.00436 | 0.00264 | 96.25% | 97.81% | 98.73% | 96.86% |

128 | 1024 | GRU | 0.0156 | 0.00403 | 0.00143 | 98.58% | 97.64% | 96.92% | 98.44% |

We first do a cross-language code clone classification experiment. The CCD dataset is randomly split into a training set and a testing set at a ratio of 9:1. For negative sampling, we use

Model | Precision | Recall | F1 |
---|---|---|---|

LSTM (pretrained token vectors) [ |
55% | 83% | 66% |

Our (randomly initialized encoder) | 64% | 88% | 74% |

Our (pretrained encoder) | 67% | 90% | 77% |

Model | Precision | Recall | F1 |
---|---|---|---|

LSTM (pretrained token vectors) [ |
19% | 90% | 32% |

Our (randomly initialized encoder) | 28% | 92% | 43% |

Our (pretrained encoder) | 31% | 93% | 47% |

Our TAE training is based on file level, a more fine-grained level (e.g., class/function level) can be considered in future work. With the trained TAE, code fragments can be represented as a vector, different languages correspond to different vector spaces. We can consider using an unsupervised learning framework like CycleGAN [

In this work, we focus on cross-language CCD using an autoencoder to pretrain on a large-scale dataset. We first introduce the node embedding method, including type embedding and value embedding. Then, we give detail about our TAE model, including the encoder and the decoder. We also present techniques about the training of TAE, including the “encode and decode by layers” strategy and the batch size design. Next, we talk about the CCD model and our negative sampling strategy. In the end, we evaluate our method in a self-collecting dataset and an open-source cross-language CCD dataset. The experimental results verify the effeteness of our node embedding method, as well as that TAE and its pretrain enhance the performance of the CCD model. The node context information is well captured and the reconstruction accuracy of the node-value reaches 95.45%. TAE pretrain improves the performance of CCD with a 4% increase in F1 score, which alleviates the data bottleneck problem.