This is the code for the SemEval 2019 Task 4, Hyperpartisan News Detection
submitted by team Bertha von Suttner
:
All are members of the GATE team of the University of Sheffield Natural Language Processing group
The model created with this was the winning entry, see the public leaderboard (sort by accuracy column, descending): https://www.tira.io/task/hyperpartisan-news-detection/dataset/pan19-hyperpartisan-news-detection-by-article-test-dataset-2018-12-07/
A blog article on the GATE blog briefly describes the approach taken.
If you wish to see the code as it was prepared for the SemEval 2019 task, then refer to the semval-2019
tag in the git repo.
Once spaCy is installed, you also need to install its
en_core_web_sm
model.
Like this:
python -m spacy download en_core_web_sm
Once NLTK is installed, you also need to install its
stopwords
data:
python -m nltk.downloader stopwords
Preparation steps:
elmo
and store the ELMo model files in that directory:
data
and save the by-article training files into it:
Run the following steps
python Preprocessing/xml2line.py -A data/articles-training-byarticle-20181122.xml -T data/ground-truth-training-byarticle-20181122.xml -F article_sent,title_sent work/train.text.tsv
python Preprocessing/line2elmo2.py -g -l 100 work/train.text.tsv work/train.elmo.tsv
python Preprocessing/line2elmo2.py -l 100 work/train.text.tsv work/train.elmo.tsv
If you get problems with the GPU memory or RAM, use the -b option to reduce the batch sizesaved_models
does not contain any model files from previous runs:
rm saved_models/*.hdf5
KERAS_BACKEND=tensorflow python CNN_elmo.py work/train.elmo.tsv
This will create a number of model files in the saved_models
directory. The file names contain the validation accuracy.python Preprocessing/xml2line.py -A $TESTXMLFILE -F article_sent,title_sent work/test.text.tsv
python Preprocessing/line2elmo2.py -g -l 100 work/test.text.tsv work/test.elmo.tsv
python Preprocessing/line2elmo2.py -l 100 work/test.text.tsv work/test.elmo.tsv
./ensemble_pred.sh work/test.elmo.tsv work/test.preds.txt