English text is used almost everywhere. It would be the best if our system can understand and generate it automatically. However, understanding natural language is a complicated task. It is so complicated that a lot of researchers dedicated their whole life to do it.
Nowadays, a lot of tools have been published to do natural language processing jobs. The following are 8 tools that I have collected. I also verified that all of them are used by some applications at least once, so they are all runnable. Some of them from industry companies, others are from research institutes. It provide functions such as parsing, finding topic automatically, etc.
- OpenNLP: a Java package to do text tokenization, part-of-speech tagging, chunking, etc. (tutorial)
- Stanford Parser: a Java implementation of probabilistic natural language parsers, both highly optimized PCFG* and lexicalized dependency parsers, and a lexicalized PCFG parser
- ScalaNLP: Natural Language Processing and machine learning.
- Snowball: a stemmer, support C and Java.
- MALLET: a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
- JGibbLDA: LDA in Java
- Apache Lucene Core: a Java library for stop-words removal and stemming
- Stanford Topic Modelling Toolbox: CVB0 algorithm, etc.
*PCFG: Probabilistic Context Free Grammar
<pre><code> String foo = "bar"; </code></pre>