07 Apr 2018: JATE 2.0 Beta.11 released. The main changes include: 1) migration to Solr 7.2.1. WARNING: the index files created by this version of Solr is not compatible with the previous versions; 2) fixing a couple of minor bugs documented in the Issues page; 3) added two more example configrations for the TTC corpora; 4) added two new algorithms, Basic and ComboBasic; 5) improved introduction page.
02 Apr 2018: JATE 2.0 Beta.9 released. The main change is migration to Solr 6.6.0 (thanks to MysterionRise) - JATE is now based on Solr 6.6.0. WARNING: the index files created by this version of Solr is not compatible with the previous versions. Please consider this before upgrading!
JATE (Java Automatic Term Extraction) is an open source library for Automatic Term Extraction (or Recognition) from text corpora. It is implemented within the Apache Solr framework (currently Solr 7.2.1), currently supporting more than 10 ATE algorithms, and almost any kinds of term candidate patterns. The integration with Solr gives JATE potential to be easily customised and adapted to different document formats, domains, and languages.
JATE is not just a library for ATE. It also implements several text processing utilities that can be easily used for other general-purpose indexing, such as tokenisation, advanced phrase and n-gram extraction. See Reasons for using JATE
Please support us by citing JATE as below:
If you use the version from this Git repository: Zhang, Z., Gao, J., Ciravegna, F. 2016. JATE 2.0: Java Automatic Term Extraction with Apache Solr. In The Proceedings of the 10th Language Resources and Evaluation Conference, May 2016, Portorož, Slovenia
If you use the old JATE 1.11 available here (no longer supported except an outdated JATE 1.0 wiki page): Zhang, Z., Iria, J., Brewster, C., and Ciravegna, F. 2008. A Comparative Evaluation of Term Recognition Algorithms. In Proceedings of The 6th Language Resources and Evaluation Conference, May 2008, Marrakech, Morocco.
A wide range of ATE tools and libraries have been developed over the years. In comparison, there are five reasons why JATE is unique:
For terminology practitioners, this means you can quickly build highly customisable ATE tools that suit your data and domain, at no cost. For terminology researchers and developers, this means that you have many necessary building blocks for developing novel ATE methods, and a uniform environment where you can evaluate and compare different methods. For general information retrieval users, you have a range of advanced text processing utilities that you can easily plug into your existing Solr or Lucene based indexing and retrieval applications.
JATE is currently maintained by a team of two members, who have other full-time roles but use as much their spare time as possible on this work. We try our best to respond to your queries but we apologise for any potential delays for this reason. However there are many ways you can contribute to JATE to potentially make it better. Currently you can obtain support from us in the following ways:
JATE is a research software that originates from an EPSRC funded project 'Abraxas'. As you may appreciate, since the project termination, there is no more funding to support the software and therefore all subsequent development and its current maintenance have been undertaken voluntarily by the team. JATE is far from perfect and yet we are trilled to see it becoming one of the most popular free text mining tools in the community, thanks to your support. 1We are also keen to make it better and therefore, we would be grateful for your contributions in many forms:
We would be grateful if you tell us a little more of your use cases with JATE: are you using JATE to conduct cutting-edge research in another (or the same) subject area? Or are you using JATE to enable your business applications? By gathering as many detailed use cases as possible, you are helping us make a compelling case to apply for fundings from various institutions to support the development and maintenance of JATE. Please get in touch with us by email and share your story with us - it costs you no money but just a little of your time!
We are keen to collaborate with any partners (academia or industry) to develop new project ideas. This can be, but not limited to, any of the following:
We welcome bug fixes, improvements, new features etc. Before embarking on making significant changes, please open an issue and ask first so that you do not risk duplicating efforts or spending time working on something that may be out of scope. To contribute code, please follow:
$ git clone [email protected]:<your-username>/jate.git $ cd jate $ git remote add upstream https://github.com/ziqizhang/jate
$ git checkout master $ git fetch upstream $ git merge upstream/master
$ git checkout -b <feature-branch-name>
$ git commit -m "Issue #<issue-number> - <commit-message>"
$ git push origin <feature-branch-name>
Important: By submitting a patch, you agree to allow the project owners to license your work under the LGPLv3 license.
A crucial resource for developing ATE methods is data, and particularly 'annotated' data that consists of text corpora as well as a list of expected 'real' terms to be found within the corpora. We call this 'gold standard'. This is critical for evaluating and improving the performance of ATE in particular domains.
If you would like to share any data you have created please also get in touch by email. We will acknowledge your credits and share a download within the Other downloads section, subject to your consent.
This Git repository only hosts the most recent version of JATE. You can obtain some of the previous versions below:
We share datasets used for the development and evaluation of ATE below.
The team member's personal webpages contain their email contacts: