WebNav

WebNav is a benchmark task for evaluating an agent with abilities to understand natural language and plan on partially observed environments. In this challenging task, an agent navigates through a web site consisting of web pages and hyperlinks to find a web page in which a query appears.

WebNav automatically transforms a website into this goal-driven web navigation task. As an example, we make WikiNav, a dataset constructed from the English Wikipedia containing approximately 5 million articles and more than 12 million queries for training.

With this benchmark, we expect faster progress in developing artificial agents with natural language understanding and planning skills.

Paper: End-to-End Goal-Driven Web Navigation

WikiNav Dataset and Other Files

The WikiNav and WikiNav-Jeopardy datasets and auxiliary files can be downloaded here:

Accessing the Dataset

Due to their large sizes, the Wikipedia articles and queries files are stored in the HDF5 format, which allows fast access without having to load them entirely into memory.

We provide wrapper classes (wiki.py and qp.py) to make your life easier when accessing these files.

For instance, the text and links of the "Machine Learning" article can be accessed using the Python code below (the h5py package is required):

import wiki
wk = wiki.Wiki('path/to/the/wiki.hdf5')

#get a dictionary of <article_title, article_id>
titles_ids = wk.get_titles_pos() 

# get the id of the 'Machine learning' article
article_id = titles_ids['Machine learning']

#get the article's text
text = wk.get_article_text(article_id)
pring 'text', text

#get links in this article as an list of articles' ids
links = wk.get_article_links(article_id)
print 'links:', links

You can also iterate over all pages:

for i, text in enumerate(wk.get_text_iter()):
    print 'content:', text
    print 'links:', wk.get_article_links(i)
    print 'title:', wk.get_article_title(i)

A sample code to access the queries and their paths starting from the page 'Category:Main_topic_classifications' is provided below:

import qp
qpp = qp.QP('path/to/the/queries_paths_2hops.hdf5')

# The input to get_queries() is a list containing the names
# of the sets you want to retrieve.
# Valid options are: 'train', 'valid', 'test'.

q_train = qpp.get_queries(['train'])  
p_train = qpp.get_paths(['train'])

print q_train[100] # get the query sample #100
print p_train[100] # get the path for query #100

Creating Your Dataset

If you want to generate your own dataset from a website, run 'create_dataset.py'. Don't forget to change the paths inside the parameters.py file to point to where you want to save the dataset.

The file 'wiki_parser.py' contains the code to parse a Wikipedia article from the dump file, given its title. If you want to parse a particular website, you will have to create your own parser method. The only input to the parse() function is the URL of the web page, and it must return the content (as a string) and a list of hyperlinks inside the page.

A simple example of a parser class with no text cleaning is provided in simple_parser.py.

Running the Model

After changing the properties in the parameters.py file to point to your local paths, the model can be trained using the following command:

THEANO_FLAGS='floatX=float32' python neuagent.py

If you want to use a GPU:

THEANO_FLAGS='floatX=float32,device=gpu0,nvcc.fastmath=True' python neuagent.py

Dependencies

To run the code, you will need:

We recommend that you have at least 32GB of RAM. If you are going to use a GPU, the card must have at least 6GB.

Reference

If you use this code as part of any published research, please acknowledge the following paper:

"End-to-End Goal-Driven Web Navigation"

Rodrigo Nogueira and Kyunghyun Cho

@inproceedings{nogueira2016end,
  title={End-to-End Goal-Driven Web Navigation},
  author={Nogueira, Rodrigo and Cho, Kyunghyun},
  booktitle={Advances In Neural Information Processing Systems},
  pages={1903--1911},
  year={2016}
}

License

Copyright (c) 2016, Rodrigo Nogueira and Kyunghyun Cho

All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.