Comparable documents miner: Arabic-English morphological analysis, text processing, n-gram features extraction, POS tagging, dictionary translation, documents alignment, corpus information, text classification, tf-idf computation, text similarity computation, HTML documents cleaning, and others.
This software is implemented by Motaz SAAD (motaz dot saad at gmail do com) during his PhD work. The PhD thesis is available at: https://sites.google.com/site/motazsite/Home/publications/saad_phd.pdf
Motaz Saad. Mining Documents and Sentiments in Cross-lingual Context. PhD thesis, Université de Lorraine, January 2015.
This software processes Arabic and English text. To use this software, load it as follows:
import imp tp = imp.load_source('textpro', 'textpro.py') #Then, you can use functions as follows: clean_text = tp.process_text(text)
This software depends on the following python packages scipy, numpy, nltk, sklearn, bs4. Please make sure that they are installed before using this software.
This software uses the following resources:
Arabic stopwords: http://www.ranks.nl/stopwords/arabic
Open Multilingual WordNet (OMW) dictionaries http://compling.hss.ntu.edu.sg/omw/ The references of OMW are listed below:
ISRI Arabic Stemmer, which is a rooting algorithm for Arabic text. The reference of ISRI Arabic Stemmer is below:
This software modifies the ISRI Arabic Stemmer to perform light stemming for Arabic words.
python dict-demo.py <inputfile> <outputfile> <source language> # translate from Arabic to English python dict-demo.py test-text-files/dict-test-ar-input.txt test-text-files/dict-out.txt ar # translate from English to Arabic python dict-demo.py test-text-files/dict-test-en-input.txt test-text-files/dict--out.txt en
python arabic-morphological-analysis-demo.py <inputfile> <outputfile> python arabic-morphological-analysis-demo.py test-text-files/test-in.ar.txt test-text-files/test-out.ar.txt