DKPro C4CorpusTools

DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.

Please use the following citation if you use C4Corpus or C4CorpusTools

The full LREC article is available at the UKP website.

Consult the official C4CorpusTools documentation which contains

As of May 2017, thanks to CommonCrawl the C4Corpus is hosted at their S3 bucket. This makes it much easier to access the data using HTTP (see the documentation).