Gradient-based hyperparameter optimization package with TensorFlow

NOTE: this package is discontinued. Please refer to the new package FARHO at https://github.com/lucfra/FAR-HO

FAR-HO is a more complete and easier-to-use version: this are the main differences and new features. This package will not be updated in the future but will be mantained for reproducibility of experiments in the paper.

The package implements the three algorithms presented in the paper Forward and Reverse Gradient-Based Hyperparameter Optimization (http://proceedings.mlr.press/v70/franceschi17a).

The first two algorithms compute, with different procedures, the gradient of a validation error with respect to the hyperparameters - i.e. the hypergradient - while the last, based on Forward-HG, performs "real time" (i.e. at training time) hyperparameter updates.

![alt text](https://github.com/lucfra/RFHO/blob/master/rfho/examples/0_95_crop.png "Response surface of a small ANN and optimization trajectory in the hyperparameter space. The arrows depicts the negative hypergradient at the current point, computed with Forward-HG algorithm.")

Installation & Dependencies

Clone the repository and run setup script.

git clone https://github.com/lucfra/RFHO.git
cd rfho
python setup.py install

Beside "usual" packages (numpy, pickle, gzip), RFHO depends on tensorflow. Some secondary module depends also on cvxopt (projections) and intervaltree. The core code works without this packages, so feel free to ignore these requirements.

Please note that required packages will not be installed automatically.


Aim of this package is to implement and develop gradient-based hyperparameter optimization (HO) techniques in TensorFlow, thus making them readily applicable to deep learning systems. The package is under development and at the moment the code is not particularly optimized; please feel free to issues comments, suggestions and feedbacks! You can email me at luca.franceschi@iit.it .

Quick Start

Core Steps

import rfho as rf
import tensorflow as tf

model = create_model(...)  
w, out = rf.vectorize_model(model.var_list, model.out)

lambda1 = tf.Variable(...)
lambda2 = tf.Variable(...)
training_error = J(w, lambda1, lambda2)
validation_error = f(w)

lr = tf.Variable(...)
training_dynamics = rf.GradientDescentOptimizer.create(w, lr=lambda1, loss=training_error)

hyper_dict = {validation_error: [lambda1, lambda2, lr]}
hyper_opt = rf.HyperOptimizer(training_dynamics, hyper_dict, method=rf.ForwardHG)

hyper_batch_size = 100
with tf.Session().as_default():
    hyper_opt.initialize()  # initializing just once corresponds to RTHO algorithm
    for k in range(...):
        hyper_opt.run(hyper_batch_size, ....)  

1 This is gradient-based optimization and for the computation of the hyper-gradients second order derivatives of the training error show up (even tough no Hessian is explicitly computed at any time); therefore, all the ops used in the model should have a second order derivative registered in tensorflow.

2 For the hyper-gradients to make sense, hyperparameters should be real-valued. Moreover, while ReverseHG should handle generic r-rank tensor hyperparameters (tested on scalars, vectors and matrices), in ForwardHG hyperparameters should be scalars or vectors.

Which Algorithm Do I Choose?

Forward and Reverse-HG compute the same hypergradient, so the choice is a matter of time versus memory!

alt text

The real-time version of the algorithms can dramatically speed-up the optimization.

The Idea Behind

The objective is to minimize some validation function E with respect to a vector of hyperparameters lambda. The validation error depends on the model output and thus on the model parameters w. w should be a minimizer of the training error and the hyperparameter optimization problem can be naturally formulated as a bilevel optimization problem.
Since these problems are rather hard to tackle, we
explicitly take into account the learning dynamics used to obtain the model
parameters (e.g. you can think about stochastic gradient descent with momentum), and we formulate HO as a constrained optimization problem. See the paper for details.

Code Structure


If you use this, please cite the paper.