Fact Extraction and VERification

Important Information

This repository requires dependencies that are no longer available on PIP and Anaconda: an updated version of this repository has been created for the FEVER2.0 shared task and is available on pip and docker. For more info please see this repository: https://github.com/j6mes/fever2-sample

About

This is the PyTorch implementation of the FEVER pipeline baseline described in the NAACL2018 paper: [FEVER: A large-scale dataset for Fact Extraction and VERification.]()

Unlike other tasks and despite recent interest, research in textual claim verification has been hindered by the lack of large-scale manually annotated datasets. In this paper we introduce a new publicly available dataset for verification against textual sources, FEVER: Fact Extraction and VERification. It consists of 185,441 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. The claims are classified as Supported, Refuted or NotEnoughInfo by annotators achieving 0.6841 in Fleiss κ. For the first two classes, the annotators also recorded the sentence(s) forming the necessary evidence for their judgment. To characterize the challenge of the dataset presented, we develop a pipeline approach using both baseline and state-of-the-art components and compare it to suitably designed oracles. The best accuracy we achieve on labeling a claim accompanied by the correct evidence is 31.87%, while if we ignore the evidence we achieve 50.91%. Thus we believe that FEVER is a challenging testbed that will help stimulate progress on claim verification against textual sources

The baseline model constists of two components: Evidence Retrieval (DrQA) + Textual Entailment (Decomposable Attention).

Find Out More

Quick Links

Pre-requisites

This was tested and evaluated using the Python 3.6 verison of Anaconda 5.0.1 which can be downloaded from anaconda.com

Mac OSX users may have to install xcode before running git or installing packages (gcc may fail). See this post on StackExchange

Support for Windows operating systems is not provided.

To train the Decomposable Attention models, it is highly recommended to use a GPU. Training will take about 3 hours on a GTX 1080Ti whereas training on a CPU will take days. We offer a pre-trained model.tar.gz that can be downloaded. To use the pretrained model, simply replace any path to a model.tar.gz file with the path to the file you downloaded. (e.g. logs/da_nn_sent/model.tar.gz could become ~/Downloads/model.tar.gz)

Change Log

Docker Install

Download and run the latest FEVER.

docker volume create fever-data
docker run -it -v fever-data:/fever/data sheffieldnlp/fever-baselines

To enable GPU acceleration (run with --runtime=nvidia) once NVIDIA Docker has been installed

Manual Install

Installation using docker is preferred. If you are unable to do this, you can manually create the python environment following instructions here: Wiki/Manual-Install

Remember that if you manually installed, you should run source activate fever and cd to the directory before you run any commands.

Download Data

Wikipedia

To download a pre-processed Wikipedia dump (license):

bash scripts/download-processed-wiki.sh

Or download the raw data and process yourself

bash scripts/download-raw-wiki.sh
bash scripts/process-wiki.sh

Dataset

Download the FEVER dataset from our website into the data directory:

bash scripts/download-data.sh

(note that if you want to replicate the paper, run scripts/download-paper.sh instead of scripts/download-data).

Word Embeddings

Download pretrained GloVe Vectors

bash scripts/download-glove.sh

Data Preparation

Sample training data for the NotEnoughInfo class. There are two sampling methods evaluated in the paper: using the nearest neighbour (similarity between TF-IDF vectors) and random sampling.

#Using nearest neighbor method
PYTHONPATH=src python src/scripts/retrieval/document/batch_ir_ns.py --model data/index/fever-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz --count 1 --split train
PYTHONPATH=src python src/scripts/retrieval/document/batch_ir_ns.py --model data/index/fever-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz --count 1 --split dev

Or random sampling

#Using random sampling method
PYTHONPATH=src python src/scripts/dataset/neg_sample_evidence.py data/fever/fever.db

Training

We offer a pretrained model that can be downloaded by running the following command:

bash scripts/download-model.sh

Skip to evaluation if you are using the pretrained model.

Train DA

Train the Decomposable Attention model

#if using a CPU, set
export CUDA_DEVICE=-1

#if using a GPU, set
export CUDA_DEVICE=0 #or cuda device id

Then either train the model with Nearest-Page Sampling for the NEI class

# Using nearest neighbor sampling method for NotEnoughInfo class (better)
PYTHONPATH=src python src/scripts/rte/da/train_da.py data/fever/fever.db config/fever_nn_ora_sent.json logs/da_nn_sent --cuda-device $CUDA_DEVICE
mkdir -p data/models
cp logs/da_nn_sent/model.tar.gz data/models/decomposable_attention.tar.gz

Or with Random Sampling for the NEI class

# Using random sampled data for NotEnoughInfo (worse)
PYTHONPATH=src python src/scripts/rte/da/train_da.py data/fever/fever.db config/fever_rs_ora_sent.json logs/da_rs_sent --cuda-device $CUDA_DEVICE
mkdir -p data/models
cp logs/da_rs_sent/model.tar.gz data/models/decomposable_attention.tar.gz

Train MLP

The MLP model can be trained following instructions from the Wiki: Wiki/Train-MLP

Evaluation

These instructions are for the decomposable attention model. The MLP model can be evaluated following instructions from the Wiki: Wiki/Evaluate-MLP

Oracle Evaluation (no evidence retrieval):

Run the oracle evaluation for the Decomposable Attention model on the dev set (requires sampling the NEI class for the dev dataset - see Data Preparation)

PYTHONPATH=src python src/scripts/rte/da/eval_da.py data/fever/fever.db data/models/decomposable_attention.tar.gz data/fever/dev.ns.pages.p1.jsonl

Evidence Retrieval Evaluation:

First retrieve the evidence for the dev/test sets:

#Dev
PYTHONPATH=src python src/scripts/retrieval/ir.py --db data/fever/fever.db --model data/index/fever-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz --in-file data/fever-data/dev.jsonl --out-file data/fever/dev.sentences.p5.s5.jsonl --max-page 5 --max-sent 5

#Test
PYTHONPATH=src python src/scripts/retrieval/ir.py --db data/fever/fever.db --model data/index/fever-tfidf-ngram=2-hash=16777216-tokenizer=simple.npz --in-file data/fever-data/test.jsonl --out-file data/fever/test.sentences.p5.s5.jsonl --max-page 5 --max-sent 5

Then run the model:

#Dev
PYTHONPATH=src python src/scripts/rte/da/eval_da.py data/fever/fever.db data/models/decomposable_attention.tar.gz data/fever/dev.sentences.p5.s5.jsonl  --log data/decomposable_attention.dev.log

#Test
PYTHONPATH=src python src/scripts/rte/da/eval_da.py data/fever/fever.db data/models/decomposable_attention.tar.gz data/fever/test.sentences.p5.s5.jsonl  --log logs/decomposable_attention.test.log

Scoring

Score locally (for dev set)

Score:

PYTHONPATH=src python src/scripts/score.py --predicted_labels data/decomposable_attention.dev.log --predicted_evidence data/fever/dev.sentences.p5.s5.jsonl --actual data/fever-data/dev.jsonl

Or score on Codalab (for dev/test)

Prepare Submission for Codalab (dev):

PYTHONPATH=src python src/scripts/prepare_submission.py --predicted_labels logs/decomposable_attention.dev.log --predicted_evidence data/fever/dev.sentences.p5.s5.jsonl --out_file predictions.jsonl
zip submission.zip predictions.jsonl

Prepare Submission for Codalab (test):

PYTHONPATH=src python src/scripts/prepare_submission.py --predicted_labels logs/decomposable_attention.test.log --predicted_evidence data/fever/test.sentences.p5.s5.jsonl --out_file predictions.jsonl
zip submission.zip predictions.jsonl