Cross-sentence Grammatical Error Correction

This repository contains the code and models to train and test cross-sentence grammatical error correction models using convolutional sequence-to-sequence models.

Data Processing:

Training Baseline and CroSent models:

Training, and running rescorer:

For training NUS3 models:

Decoding using pre-trained cross-sentence GEC models

  1. Run to prepare the test datasets.

  2. Download all pre-requiste components (BPE model, embeddings, dictionaries, and pre-trained decoder) using the

  3. Download CroSent models using script.

  4. Decode development/test sets with

./ $testset $modelpath $dictdir $optionalgpu

$testset is the test dataset name. The test dataset files are in the format data/test/$testset/$testset.tok.src (for the input source sentences) and data/test/$testset/$testset.tok.ctx (for the context sentences, i.e. 2 previous sentences per line)

$modelpath: could be a file for decoding using a single model or a directory for ensemble (any model with the name within the specified directory will be used in the ensemble). If single model, the decoder will output the files into a directory in the same location as the model path, with the name same as the model path with a prefix outputs.. If ensemble, the decoder will output the files into outptus/ directory within $model_path

$dictdir contains the path to the dictionaries. For pre-trained models it is models/dicts

$optionalgpu is an optional parameter indicating GPU id to run the decoding on (default=0).

  1. Run rearnker using the downloaded weights:
    ./ $outputsdir $testset $weightsfile $optionalgpu

    where $outputsdir is the directory which contains the output of the decoding and $weightsfile is the paths to trained weights (in the case of pretrained weights, it is models/reranker_weights/weights.nucle_dev.txt)

Training from scratch

Data preparation

Download the required datasets and run with the paths to Lang-8 and NUCLE to prepare the datasets.


Download all pre-requiste components (BPE model, dictionary files, embeddings, and pre-trained decoder) using the

Each training script train_*.sh has a parameter to specify the random seed value. To train 4 different models, run the training script 4 times by variying the seed values (e.g., 1, 2, 3, 4)

For training the baseline models use script.

For training the crosent models, use script.

For training the NUS2 model, use script.

For training the NUS3 model

  1. Generate alignments using fastalign (Requires fast_align and moses under tools/ directory), run data/processed
  2. Run script.

For training the reranker:

  1. Decode development dataset using ./ (steps mentioned earlier). Set $outputsdir to the output directory of this decoding step.

  2. Run ./ $outputsdir $devset $optionalgpu


The source code is licensed under GNU GPL 3.0 (see LICENSE) for non-commerical use. For commercial use of this code, separate commercial licensing is also available. Please contact Prof. Hwee Tou Ng (