Downloading Data

Whether you would like to use our system or use only our dataset, the easiest way to do so is use our script. It is a standalone script whose only dependencies are python 3.6 and the package click which can be installed via pip install click.

The following commands can be used to download our dataset, or datasets we use in either the system or paper plots. Data will be downloaded to data/external/datasets by default, but can be changed with the --local-qanta-prefix option

File Description:


The recommended way to run our system is to use the Anaconda python distribution. The environment.yaml can be used to create a conda environment with all the necessary software versions installed.

The qanta system has the following dependencies. Depending on your objective however not all are necessary. The python packages are generally required so that imports resolve, Apache Spark is required for many data preprocessing steps, Vowpal Wabbit is not needed for anything except training a linear model, Spacy is required for preprocessing, Elastic Search is required for the IR based models, and lz4 and the AWS cli are necessary for downloading data not part of the script.

Installing Apache Spark


You can test is spark is installed property by running something like the following:

> python
Python 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:09:58) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from qanta.spark import create_spark_context
>>> sc = create_spark_context()
Using Spark's default log4j profile: org/apache/spark/
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/07/25 10:04:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/07/25 10:04:01 WARN Utils: Your hostname, hongwu resolves to a loopback address:; using instead (on interface eth0)
17/07/25 10:04:01 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address

Installing Elastic Search 5.6


Install version 5.6.X, do not use 6.X. Also be sure that the directory bin/ within the extracted files is in your $PATH as it contains the necessary binary elasticsearch.

Installing Python packages

Either use environment.yaml or:

pip install -r packer/requirements.txt

NLTK Models

# Download nltk data
$ python3 download

Qanta on Path

In addition to these steps you need to either run python develop or include the qanta directory in your PYTHONPATH environment variable. We intend to fix path issues in the future by fixing absolute/relative paths.


QANTA configuration is done through a combination of environment variables and the qanta-defaults.yaml/qanta.yaml files. QANTA will read a qanta.yaml first if it exists, otherwise it will fall back to reading qanta-defaults.yaml. This is meant to allow for custom configuration of qanta.yaml after copying it via cp qanta-defaults.yaml qanta.yaml.

The configuration of most interest is how to enable or disable specific guesser implementations. In the guesser config the keys such as qanta.guesser.dan.DanGuesser correspond to the fully qualified paths of each guesser. Each of these keys contain an array of configurations (this is signified in yaml by the -). Our code will inspect all of these configurations looking for those that have enabled: true, and only run those guessers. By default we have enabled: false for all models. If you simply want to perform a sanity check we recommend enabling qanta.guesser.tfidf.TfidfGuesser. If you are looking for our best model and configuration you should use enable qanta.guesser.rnn.RnnGuesser.

Running QANTA

Running qanta is managed primarily by two methods: ./ and Luigi. The former is used to run specific commands such as starting/stopping elastic search, but in general luigi is the primary method for running our system.

Luigi Pipelines

Luigi is a pure python make-like framework for running data pipelines. Below we give sample commands for running different parts of our pipeline. In general, you should either append --local-scheduler to all commands or learn about using the Luigi Central Scheduler.

For these common tasks you can use command luigi --local-scheduler followed by:

Qanta CLI

You can start/stop elastic search with

AWS S3 Checkpoint/Restore

To provide and easy way to version, checkpoint, and restore runs of qanta we provide a script to manage that at We assume that you set an environment variable QB_AWS_S3_BUCKET to where you want to checkpoint to and restore from. We assume that we have full access to all the contents of the bucket so we suggest creating a dedicated bucket.

Information on our data sources

Wikipedia Dumps

As part of our ingestion pipeline we access raw wikipedia dumps. The current code is based on the english wikipedia dumps created on 2017/04/01 available at

Of these we use the following (you may need to use more recent dumps)

To process wikipedia we use with the following command:

$ --processes 15 -o parsed-wiki --json enwiki-20170401-pages-articles-multistream.xml.bz2

Do not use the flag to filter disambiguation pages. It uses a simple string regex to check the title and articles contents. This introduces both false positives and false negatives. We handle the problem of filtering these out by using the wikipedia categories dump

Afterwards we use the following command to tar it, compress it with lz4, and upload the archive to S3

tar cvf - parsed-wiki | lz4 - parsed-wiki.tar.lz4

Wikipedia Redirect Mapping Creation

The output of this process is stored in s3://pinafore-us-west-2/public/wiki_redirects.csv

All the wikipedia database dumps are provided in MySQL sql files. This guide has a good explanation of how to install MySQL which is necessary to use SQL dumps. For this task we will need these tables:

To install, prepare MySQL, and read in the Wikipedia SQL dumps execute the following:

  1. Install MySQL sudo apt-get install mysql-server and sudo mysql_secure_installation
  2. Login with something like mysql --user=root --password=something
  3. Create a database and use it with create database wikipedia; and use wikipedia;
  4. source enwiki-20170401-redirect.sql; (in MySQL session)
  5. source enwiki-20170401-page.sql; (in MySQL session)
  6. This will take quite a long time, so wait it out...
  7. Finally run the query to fetch the redirect mapping and write it to a CSV by executing bin/redirect.sql with source bin/redirect.sql. The file will be located in /var/lib/mysql/redirect.csv which requires sudo access to copy
  8. The result of that query is CSV file containing a source page id, source page title, and target page title. This can be interpretted as the source page redirecting to the target page. We filter namespace=0 to keep only redirects/pages that are main pages and trash things like list/category pages

Wikipedia Category Links Creation

The purpose of this step is to use wikipedia category links to filter out disambiguation pages. Every wikipedia page has a list of categories it belongs to. We filter out any pages which have a category which includes the string disambiguation in its name. The output of this process is a json file containing a list of page_ids that correspond to known disambiguation pages. These are then used downstream to filter down to only non-disambiguation wikipedia pages.

The output of this process is stored in s3://pinafore-us-west-2/public/disambiguation_pages.json with the csv also saved at s3://pinafore-us-west-2/public/categorylinks.csv

The process for this is similar to redirects, except that you should instead source a file named similar to enwiki-20170401-categorylinks.sql, run the script bin/categories.sql, and copy categorylinks.csv. Afterwards run ./ categories disambiguate categorylinks.csv data/external/wikipedia/disambiguation_pages.json. This file is automatically downloaded by the pipeline code like the redirects file so unless you would like to change this or inspect the results, you shouldn't need to worry about this.

SQL References

These references may be useful and are the source for these instructions:

Debugging FAQ and Solutions

pyspark uses the wrong version of python

Set PYSPARK_PYTHON to be python3

ImportError: No module named 'pyspark'


ValueError: unknown locale: UTF-8

export LC_ALL=en_US.UTF-8 export LANG=en_US.UTF-8

TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'

Python 3.6 needs Spark 2.1.1

Qanta ID Numbering