Solr Text Tagger

This project implements a "naive" text tagger based on Apache Lucene / Solr, using Lucene FST (Finite State Transducer) technology under the hood for remarkable low-memory properties. It is "naive" because it does simple text word based substring tagging without consideration of any natural language context. It operates on the results of how you configure text analysis in Lucene and so it's quite flexible to match things like phonetics for sounds-like tagging if you wanted to. For more information, see the presentation video/slides referenced below.

The tagger can be used for finding entities/concepts in large text, or for doing likewise in queries to enhance query-understanding.

For a list of changes with version of this tagger, to include Solr & Java version compatibility, see CHANGES.md

Note: the STT is included in Apache Solr 7.4.0 !!!

Solr 7.4.0 now includes the Solr Text Tagger. It's documented in the Solr Reference Guide. As-such, you likely should just use the one in Solr and not the one here. That said, htmlOffsetAdjust is not implemented there. Issues #82 and #81 document some information about the differences and contain further links.

Resources / References

Pertaining to Lucene's Finite State Transducers:

Contributors:

Quick Start

See the QUICK_START.md file for a set of instructions to get you going ASAP.

Build Instructions

The build requires Java (v8 or v9) and Maven.

To compile and run tests, use:

%> mvn test

To compile, test, and build the jar (placed in target/), use

%> mvn package

Configuration

A Solr schema.xml needs 2 things

If you want to support typical keyword search on the names, not just tagging, then index the names in an additional field with a typical analysis configuration to your preference.

For tagging, the name field's index analyzer needs to end in either shingling for "partial" (i.e. sub name phrase) matching of a name, or more likely using ConcatenateFilter for complete name matching. ConcatenateFilter acts similar to shingling but it concatenates all tokens into one final token with a space separator. The query time analysis should not have Shingling or ConcatenateFilter.

Prior to shingling or the ConcatenateFilter, preceding text analysis should result in consecutive positions (i.e. the position increment of each term must always be 1). As-such, Synonyms and some configurations of WordDelimiterFilter are not supported. On the other hand, if the input text has a position increment greater than one (e.g. stop word) then it is handled properly as if an unknown word was there. Support for synonyms or any other filters producing posInc=0 is a feature that has largely been overcome in the 1.1 version but it has yet to be ported to 2.x; see Issue #20, RE the PhraseBuilder

To make the tagger work as fast as possible, configure the name field with postingsFormat="FST50";. In doing so, all the terms/postings are placed into an efficient FST data structure.

Here is a sample field type config that should work quite well:

<fieldType name="tag" class="solr.TextField" positionIncrementGap="100" postingsFormat="FST50"
    omitTermFreqAndPositions="true" omitNorms="true">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory" />
    <filter class="solr.ASCIIFoldingFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory" />

    <filter class="org.opensextant.solrtexttagger.ConcatenateFilterFactory" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.EnglishPossessiveFilterFactory" />
    <filter class="solr.ASCIIFoldingFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory" />
  </analyzer>
</fieldType>

A Solr solrconfig.xml needs a special request handler, configured like this.

<requestHandler name="/tag" class="org.opensextant.solrtexttagger.TaggerRequestHandler">
  <lst name="defaults">
    <str name="field">name_tag</str>
    <str name="fq">PUT SOME SOLR QUERY HERE; OPTIONAL</str><!-- filter out -->
  </lst>
</requestHandler>

Also, to enable custom so-called postings formats, ensure that your solrconfig.xml has a codecFactory defined like this:

<codecFactory name="CodecFactory" class="solr.SchemaCodecFactory" />

Usage

For tagging, you HTTP POST data to Solr similar to how the ExtractingRequestHandler (Tika) is invoked. A request invoked via the "curl" program could look like this:

curl -XPOST \
  'http://localhost:8983/solr/collection1/tag?overlaps=NO_SUB&tagsLimit=5000&fl=*' \
  -H 'Content-Type:text/plain' -d @/mypath/myfile.txt

The tagger request-time parameters are

Output

The output is broken down into two parts, first an array of tags, and then Solr documents referenced by those tags. Each tag has the starting character offset, an ending character (+1) offset, and the Solr unique key field value. The Solr documents part of the response is Solr's standard search results format.

Advanced Tips