This is an implementation of an algorithm discussed in Ganin et. al (2015), Glorot et. al (2011), and Ben-David et. al (2007). It has been adapted for use with machine translation datasets, and released to the public under the MIT license.
This algorithm computes the Proxy A-Distance (PAD) between two domain distributions. PAD is a measure of similarity between datasets from different domains (e.g. newspapers and talk shows). The lower the PAD, the closer two datasets are.
The algorithm is as follows:
e
on a held-out test set.PAD = 2 (1 − 2e)
We use a linear bag-of-words SVM for the underlying classifier.
pip install numpy
pip install sklearn
python main.py [corpusfile 1] [corpusfile 2] [vocab file]
corpusfile 1
is a text file with one sentence per line.corpusfile 2
is another text file with one sentence per line.vocab
is a text file with one token per line. These tokens represent a shared vocabulary for the above corpusfiles.python main.py test_data/europarl.en test_data/europarl.fr test_data/opensubtitles.en test_data/opensubtitles.fr test_data/vocab