Solving the puzzle cube with deep reinforcement learning and tree search

This is a project to solve the 3x3x3 puzzle cube (a.k.a. the Rubik’s cube) using reinforcement learning. While the puzzle cube is definitely a "solved problem"—there are many algorithms which can be used to solve a scrabbled cube—the goal is to have a computer learn from scratch how to do it. In reinforcement learning a computer tries to solve a scrabbled cube on its own with little or no human knowledge, and then, through trial and error, slowly learns the best way to do it.

More technically, I use a combination of a residual/convolutional neural network (NN) and Monte Carlo tree search (MCTS). This is (almost) the same technique employed by Deepmind’s AlphaGo Zero and AlphaZero to play board games like Go and Chess at super human levels.

The puzzle cube is a fun choice for reinforcement learning, because it is non-trivial for a human to learn from scratch. (For example, I don’t even know how to solve one, even though I have spent a number of years trying on my own.) However, for some of those who do know an algorithm, solving the cube is second nature, involving just what seems to be pattern recognition and intuition, not deep mathematical reasoning. This suggests that the right reinforcement learning algorithm could be used to learn to solve the cube—especially algorithms using neural networks, which are good at pattern recognition.

Update 1, April 15, 2019: In the year since this project was completed, there has been a break through in using neural networks to solve the puzzle cube. McAleer, Agostinelli, Shmakov, and Baldi (preprint, pre-conference version) independently used reinforcement learning to solve the puzzle cube. There are many similarities between their work and mine including the use of Monte Carlo tree search. Despite their progress, there is still a lot to be done on solving the puzzle cube with reinforcement learning. While McAleer et al.'s trained algorithm solves any puzzle cube, it often takes hours to do so, whereas hand-programmed solvers can do so instantly. See this expository article for an (unsuccessful) attempt to reproduce McAleer et al.'s results.

Update 2, April 15, 2019: I made this code base much easier to use, and I've supplied pretrained weights so one doesn't have to train their own network. See Getting Started With The Code.

Results

Graph of performance of policy and policy+MCTS

This plot shows the performance of the neural network alone (that is the policy output of the neural network) and the neural network when combined with MCTS (which gives the neural network a boost but takes longer to run). (The MCTS plot contains error bars since I didn't generate enough samples to get a smooth plot.)

I also compare my work to the state of the art. These two puzzle solving neural networks are the best I could find. (For more puzzle cube projects, see the complete list at the bottom of this page.)

Both Alex Irpan's and UnixPickle's algorithms were trained using supervised learning—they were trained using known algorithms for solving the puzzle cube, whereas my approach used reinforcement learningit learned to solve the puzzle cube by all by itself!

I still hold out hope that a different reinforcement algorithm can be used to solve the puzzle cube completely. (See the Further Ideas section below for some next steps.)

Getting started with the code

Cloning the repo

$ git clone --recursive https://github.com/jasonrute/puzzle_cube

(The --recursive ensures you clone the code submodule as well.)

Using the pretrained network

$ cd puzzle_cube/code/
$ python3 example.py

To run this you will need TensorFlow, as well as a number of other packages which can be installed using pip. The example.py script is designed to be easy to follow and adapt. This code runs fine on my MacBook Air.

Training a network from scratch

$ cd puzzle_cube/code/
$ python3 train.py

The output is a bit cryptic, but as long as it is running, it is training new networks which are stored in the results directory. Training is too slow and memory intesive to run on my Macbook Air. Instead I trained it for 34 generations (which took \~2 days) on an AWS g3.4x.large spot instance.

To dig in more, look at code/config.py for various settings that can be changed. For example, lower the number of games_per_generation (a "game" is one attempt to solve a randomly shuffled cube).

Other details

(If you are forking this project, you may have to fork both this respository and jasonrute/puzzle_cube_code. Then you may have to reassociate the URL for the code submodule. I am not sure. See here.)

The code directory is a separate repository. After making any changes to the code, commit the changes to the code repository. Then the results will be saved in a directory under results. (The directory is named using the git describe command which is run from inside the code repository.) To use a trained model from a previous run put the name of that directory in the prev_versions list in the config.py file, e.g. prev_versions = ['v1.0.2-r2', 'v1.0.2-r']

Some of the big ideas

A neural network is basically a black box which takes in input and gives output. It can be trained with labeled data. Our neural network has the following input and output. (See Neural network architecture in Technical details section for the full architecture.)

At the beginning the neural network is just random (and is later trained).

The model is fed into a search algorithm based on Monte Carlo tree search (MCTS). The network uses its policy algorithm to explore the tree of possible paths to a solved cube. Since the number of paths growths exponentially, the tree search has to be judicious in choosing which branches to explore. At each node in our tree search we continuously transition from exploration to exploitation:

We generate data by shuffling a solved cube for n steps and try to solve it with our neural network and MCTS. (Here n is chosen so that we solve the cube about 50% of the time. If we picked a completely random cube, the MCTS would almost certainly fail, especially at the early stages of training.) We then generate data as follows:

Technical points

For most technical details, one can read the well-written AlphaGo Zero paper. However, my training algorithm differs on a few key points, which I outline here. For more details, I recommend playing around with the program and exploring the code.

1. MCTS and PUCT

In AlphaGo Zero the neural network value represents a signed probability of winning (where 1 is winning and -1 is losing). However, I use the neural network value as both a measure of the distance to the goal and the probability of solving the cube. To do this I make two changes to the MCTS/PUCT algorithm.

2. Neural network architecture

The architecture of the neural network is similar to AlphaGo Zero’s but smaller (only 4 residual blocks and 64 filters for the convolutional layers). (Also the convolutional part, explained in Convolution trick, is novel.) The network consists of the following layers.

3. Convolution trick

While some sort of convolutional neural network seems to make the most sense, the cube does not fit well into the framework of a 2D convolutional neural network. (If one stacks the six 3 x 3 sides on top of each other, there is no natural way to orient the sides.) Instead, I embedded the 54 squares into a 5 x 5 x 5 array. This can be done naturally:

Then I performed a 3D convolution with 3 x 3 x 3 window and a step size of 1. While it would be possible to just continue to use the 5 x 5 x 5 ( x 64 filters) array for each layer, this was too computationally prohibitive. Instead I “masked” the array so that it only used the 54 embedded coordinates instead of the full 125. I did this with the following trick:

4. Using symmetry

Like AlphaGo Zero, I augmented my training data by symmetries. I use the 48 reflections/rotations of the cube. (I also rotate the colors to keep the fixed center squares of each side the same color.) Also when performing the MCTS, I would randomly rotate the input before feeding it into the neural network (adjust the policy correspondingly).

Further ideas

It seems that there should be some natural neural network solution to the puzzle cube and moreover, that it could be discovered by reinforcement learning. Here are some ideas to try.

List of neural network puzzle cube solvers

There are a number of projects looking using neural networks to solve the puzzle cube. Most are very different from mine, or don't have good results, but to gather them all in one place I list them here.