Build an AI Programmer using Recurrent Neural Network (2)

Recurrent Neural Networks (RNNs) are gaining a lot of attention in recent years because it has shown great promise in many natural language processing tasks. Despite their popularity, there are a limited number of tutorials which explain how to implement a simple and interesting application using the state-of-art tools. In this series, we will use a recurrent neural network to train an AI programmer, which can write Java code like a real programmer (hopefully). The following will be covered:

1. Building a simple AI programmer
2. Improving the AI programmer - Using tokens (this post)
3. Improving the AI programmer - Using different network structures

In the previous post, we built a basic AI programmer using a simple 1-layer LSTM neural network. The code that the AI programmer generates does not make much sense. In this post, we will use tokens instead of individual character sequences to train the model.

1. Getting the training raw data

I'm using the same source code as the previous post. It is available here: https://github.com/frohoff/jdk8u-jdk. This time, each .java file is scanned, tokenized, and then aggregated into one file called "jdk-tokens.txt". Line breaks are not reserved. You don't need to download the JDK source code. For your convenience, I have included the aggregated file in the GitHub repository of this project. You can find the link at the end of this post.

The following code reads tokens from the jdk-tokens.txt file and slice it to fit the hardware capability of my desktop. In my case, I only used 20% of the code as shown in the code.

path = "./jdk-tokens.txt"
filetext = open(path).read().lower()
 
# slice the whole string to overcome memory limitation
slice = len(filetext)/5  
slice = int (slice)
filetext = filetext[:slice]
 
tokenized = filetext.split()
 
print('# of tokens:', len(tokenized))

2. Building index to address the tokens

LSTM inputs only understand numbers. One way of converting tokens to numbers is to assign a unique integer to each token. For example, if there are 1000 unique tokens in the code, we can assign a unique number to each of the 1000 tokens. The code below builds a dictionary with entries like [ “public” : 0 ] [ “static” : 1 ], ... ]. The reversed dictionary is also generated for decoding the output of LSTM.

uniqueTokens = sorted(list(set(tokenized)))
print('total # of unique tokens:', len(uniqueTokens))
token_indices = dict((c, i) for i, c in enumerate(uniqueTokens))
indices_token = dict((i, c) for i, c in enumerate(uniqueTokens))

3. Preparing the training sequences with labels

Here we cut the text in semi-redundant sequences of 10 tokens. Each sequence is a training sample and the label of each token sequence is the next token.

NUM_INPUT_TOKENS = 10
step = 3
sequences = []
next_token = []
 
for i in range(0, len(tokenized) - NUM_INPUT_TOKENS, step):
    sequences.append(tokenized[i: i + NUM_INPUT_TOKENS])
    next_token.append(tokenized[i + NUM_INPUT_TOKENS])
 
print('nb sequences:', len(sequences))

4. Vectorizing training data

We first create two matrices and then put values to each of them. One for the features and one for the label. len(sequences) is the total # of training samples.

X = np.zeros((len(sequences), NUM_INPUT_TOKENS, len(uniqueTokens)), \
             dtype=np.bool)
y = np.zeros((len(sequences), len(uniqueTokens)), dtype=np.bool)
for i, sentence in enumerate(sequences):
    for t, char in enumerate(sentence):
        X[i, t, token_indices[char]] = 1
    y[i, token_indices[next_token[i]]] = 1

5. Constructing a single layer LSTM model

We are build a network like the following:

In addition, it is pretty straightforward to stack two LSTM layers as also shown in the commented code below.

The code below defines the structure of the neural network. The network contains a layer of LSTM with 128 hidden units. The input_shape parameter specifies that the input sequence length and the dimension of input at each time. Dense() implements output = activation(dot(input, kernel) + bias) . The input here is the output of LSTM layer. The activation function is specified by the line Activation('softmax'). Optimizer is the optimization function. You may be familiar with the commonly used one in logistic regression, which is stochastic gradient descent. The last line specifies the cost function. In this case, we use 'categorical_crossentropy'. You may check out this nice post to understand why crossentropy is better than mean squared error (MSE).

model = Sequential()
 
# 1-layer LSTM
#model.add(LSTM(128, input_shape=(NUM_INPUT_TOKENS, len(uniqueTokens))))
 
# 2-layer LSTM
model.add(LSTM(128,return_sequences=True, \
               input_shape=(NUM_INPUT_TOKENS, len(uniqueTokens))))
model.add(LSTM(128))
 
model.add(Dense(len(uniqueTokens)))
model.add(Activation('softmax'))
 
optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
print(model.summary())

Above, I also include the code to stack another layer of LSTM and make it a 2-layer LSTM RNN.

6. Training model and generating Java code

The sample function is used to sample an index from a probability array. For example, given preds=[0.5,0.2,0.3] and a default temperature, the function return index 0 with probability 0.5, 1 with probability 0.2, or 2 with probability 0.3. It is used to avoid generate the same sentence over and over again. We want to see some different code sequence the AI Programmer can code.

def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)
 
# train the model, output generated code after each iteration
for iteration in range(1, 60):
    print()
    print('-' * 50)
    print('Iteration', iteration)
    model.fit(X, y, batch_size=128, epochs=1)
 
    start_index = random.randint(0, len(tokenized) - NUM_INPUT_TOKENS - 1)
 
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print()
        print('----- diversity:', diversity)
 
        generated = [] #''
        sequence = tokenized[start_index: start_index + NUM_INPUT_TOKENS]
 
        generated=list(sequence)
 
        print('----- Generating with seed: "' + ' '.join(sequence) + '"-------')
        sys.stdout.write(' '.join(generated))
 
        for i in range(100):
            x = np.zeros((1, NUM_INPUT_TOKENS, len(uniqueTokens)))
            for t, char in enumerate(sequence):
                x[0, t, token_indices[char]] = 1.
 
            preds = model.predict(x, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_pred_token = indices_token[next_index]
 
            generated.append(next_pred_token)
            sequence = sequence[1:]
            sequence.append(next_pred_token)
 
            sys.stdout.write(next_pred_token+" ")
            sys.stdout.flush()
        print()

7. Results

It takes a few hours to train the model. I stopped at the 40th iteration and the generated code looks like the following:

----- Generating with seed: "true ) ; } else { boolean result = definesequals"-------
true ) ; } else { boolean result = definesequals
( ) . substring ( 1 , gradients . get ( p ) ; } 
if ( val . null || ( npoints == null ) ? new void . bitlength ( ) + prefixlength ) ; 
for ( int i = 0 ; i < num ; i ++ ) } break ; } 
if ( radix result = != other . off ) ; 
int endoff = b . append ( buf , 0 , len + 1 ) ; digits ++ ] ; 

The code generate looks much better than the code generated by the previous character-based approach. Note that line breaks were added by me for readability purpose. We can see that LSTM captures the loops and conditions pretty well, the code start making more sense. For example, "for ( int i = 0 ; i < num ; i ++ )" is a perfect Java for loop. If you tune the parameters (such as NUM_INPUT_CHARS and STEP) and train longer, you may get better results. Feel free to try. Again, I already know a better way to do this job, so I stop here and do the improvement in the next post.

You may also take a look at the code generated in the earlier iterations. They make less sense.

8. What's Next?

In this post, I used tokens sequences as input to train the model and the model predicts token sequences. If everything works correctly, it should work better than the character-based approach. In addition, we can also use different network structures. We will explore those in the next post.

Source Code

1) The source code of this post is lstm_ai_coder_tokens.py which is located at https://github.com/ryanlr/RNN-AI-Programmer

Category >> deep learning  
If you want someone to read your code, please put the code inside <pre><code> and </code></pre> tags. For example:
<pre><code> 
String foo = "bar";
</code></pre>

Leave a comment

*