Build Status

Query Segmenter

The QuerySegmenter core library is used to find typed segments within a user query. For example, for the query “Pizza New York”, the segment “New York” can be extracted as a segment of type “city”. The typed segments are matched against a dictionary, which is usually a text file.


Solr Version



You need maven and JDK 8:

$ mvn clean package

Release Notes (2017-10-19) (2017-05-09) (2016-12-09) (2016-06-27)

Core Library


The main interface is QuerySegmenter which contains this method:

List<TypedSegment> segment(String query);

The user query is passed to this method and a list of typed segments found within the query is returned. Note that multiple typed segments can be returned. For example, if the query is “car park slope new york”, the method call could return “park slope” as a segment of type neighborhood and “new york” as a city segment. Also, for the query “pizza new york”, the method call could return 2 typed segments for “new york”, one as a city and another one as a state. Or it could return the same type for both, but for different matches in the dictionary. For example, the query “pizza menlo park” could return 2 segments of type neighborhood, one for Menlo Park, CA and another one for Menlo Park, NJ. The decision about how many typed segments will be returned by type is placed upon each type dictionary.

A TypeSegment object has a getMetadata method that returns the metadata of the typed segment as stored in the dictionary. For example, for a location type segment, the metadata could be the latitude and longitude of a rectangle that encompasses the location.

The class QuerySegmenterDefaultImpl is responsible for splitting the user query into multiple segments and asking each dictionary if they have a match for each segment. This default implementation always parses the user query from left to right, and tries the longest segment possible first (the window size is set to 4 in this class). For example, for the query “fast pizza delivery new york”, the segmenter will generate these segments in order:

  1. fast pizza delivery new
  2. fast pizza delivery
  3. fast pizza
  4. fast
  5. pizza delivery new york
  6. pizza delivery new
  7. pizza delivery
  8. pizza
  9. delivery new york
  10. delivery new
  11. delivery
  12. new york
  13. new
  14. york

Each of these segments will be looked up in each dictionary for matches.

Segment Dictionary

Generic Segment

A dictionary holds a flat list of words. A case-insensitive lookup is made to retrieve a word from the list. It can be used to look up whether a word is part of a dictionary and act upon that knowledge. For example, it could be used to prefix a word in a query if it is found in a dictionary.

Synonym Segment

Dictionary used to list synonyms of a label. If we have this entry in the dictionary:

New York,nyc,Big Apple

Then "New York" will be returned when "nyc" is looked up in this dictionary.

The first element of the line is the label that will be returned and all other elements on the same line are synonyms. If a lookup is done on the first element, that element is returned. For example, using the dictionary described above, a lookup for 'new york' will return "New York". We can also use a plain list of words without any synonyms. In that case, this dictionary will behave exactly like the Generic Segment Dictionary.

Area Segment

An area segment is a location segment that represents a rectangular geographical area. An area segment has a minimum and maximum latitude and a minimum and a maximum longitude. The dictionary implementation is the class AreaSegmentDictionaryMemImpl and this returns AreaTypedSegment object.

Here is an example of a file that is read by the area dictionary (it represents neighborhood data of Anchorage, AK):

Old Seward-Oceanview,61.116429,-149.786808,61.040014,-149.899467
Portage Valley,60.906335,-148.740705,60.733033,-149.051696
Glen Alps,61.108623,-149.686223,61.083627,-149.714678
Campbell Park,61.180852,-149.762532,61.166392,-149.860504
Eagle River Valley,61.353984,-149.254768,61.245675,-149.550315
Bear Valley,61.096898,-149.686291,61.045506,-149.7537

The first field is the label of the segment. This label will be used to match a segment in the user query. The other fields are the metadata of each area. For area, the other fields are minlat, minlon, maxlat and maxlon.

Note that this dictionary does a case-insensitive match. If the user query contains “northeast”, it will still match the “Northeast” label defined in the dictionary.

Centroid Segment

A centroid segment is a location defined by a center location (latitude and longitude). Here is a centroid file (it represents the center location of some US cities) used by the CentroidSegmentDictionaryMemImpl class:


The CentroidSegmentDictionaryMemImpl dictionary returns CentroidTypedSegment.

Note that this dictionary does a case-insensitive match. If the user query contains “aaronsburg”, it will still match the “Aaronsburg” label defined in the dictionary.

Solr Integration

The QuerySegmenter Solr library includes Solr components that use the QueryComponent core library. It currently contains 2 components: QuerySegmenterQParser and CentroidComponent.

Deployment Library Files.

Copy the QuerySegmenter Solr library jar files (st-QuerySegmenter-core-x.y.z.jar and st-QuerySegmenter-solr-x.y.z.jar) into the lib folder of your Solr core (as defined in solr.xml file).


This QParser is used to retrieve segments from a user query. Any dictionary can be used.

If there is a segment in the user query that matches an element of the dictionary, the query is rewritten using either the label or the location (only for the area segment dictionary). For example, for the query “pizza brooklyn”, if “brooklyn” is an area, the query will be rewritten to “pizza neighborhood:brooklyn” or “pizza location:[minlat,minlon TO maxlat, maxlon]”. The field to use and whether we should use the label or the location is configurable.


The QuerySegmenterQParser needs to be configured in the solrconfig.xml file. Here is an example:

<queryParser name="seg"
  <lst name="segments">
    <lst name="neighborhood">
      <str name="field">location</str>
      <str name="dictionary">com.sematext.querysegmenter.geolocation.AreaSegmentDictionaryMemImpl</str>
      <str name="filename">${solr.solr.home}/${}/conf/segmenter/neighborhood.txt</str>
      <bool name="useLatLon">true</bool>
    <lst name="authors">
      <str name="field">author</str>
      <str name="dictionary">com.sematext.querysegmenter.GenericSegmentDictionaryMemImpl</str>
      <str name="filename">${solr.solr.home}/${}/conf/segmenter/authors.txt</str>
      <bool name="useLatLon">false</bool>

This will configure the QuerySegmenterQParser to use an area segment dictionary and a generic segment dictionary. The first dictionary will load the neighborhood.txt file, while the other will load the authors.txt file. If a match is found in a dictionary, the query will be rewritten using the field defined and, in the case of an area, will use the latitude and longitude of the area instead of the label.

It is also possible to use the QParser within a request handler. Here is an example:

<requestHandler name="/segmenter" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="echoParams">explicit</str>
    <str name="q">{!seg defType=edismax v=$qq}</str>
    <str name="qf">body^2.0 id</str>


To use the QParser directly, use LocalParams syntax:


It is also possible to define another QParser to be used for the rest of the query:


In the above example, the Query Segmenter would first find the “new york” typed segment and rewrite the query to pizza city:”new york”, and then this rewritten query would be handled by the eDismax parser which would use just the “pizza” part with fields defined in its qf. The city:”new york” portion of the query would not be used with qf because of the field-specific prefix.

To use with the request handler defined in the previous section, use parameter dereferencing:



A component that works like the QParser described above, but implemented as a Solr SearchComponent instead of a QParser. A SearchComponent must be used with a Solr RequestHandler. This specific component must be used before the standard query component (or simply defined to be the first component), because it needs to rewrite the query before the query is made against Solr.


Here is an example configuration (in solrconfig.xml):

<searchComponent name="segmenter"
  <lst name="segments">
    <lst name="authors">
      <str name="field">author</str>
      <str name="dictionary">com.sematext.querysegmenter.GenericSegmentDictionaryMemImpl</str>
      <str name="filename">${solr.solr.home}/${}/conf/segmenter/authors.txt</str>
      <bool name="useLatLon">false</bool>
    <lst name="types">
      <str name="field">type</str>
      <str name="dictionary">com.sematext.querysegmenter.GenericSegmentDictionaryMemImpl</str>
      <str name="filename">${solr.solr.home}/${}/conf/segmenter/types.txt</str>
      <bool name="useLatLon">false</bool>
    <lst name="projects">
      <str name="field">project</str>
      <str name="dictionary">com.sematext.querysegmenter.GenericSegmentDictionaryMemImpl</str>
      <str name="filename">${solr.solr.home}/${}/conf/segmenter/projects.txt</str>
      <bool name="useLatLon">false</bool>
    <lst name="suffix">
      <str name="field">suffix</str>
      <str name="dictionary">com.sematext.querysegmenter.SynonymSegmentDictionaryMemImpl</str>
      <str name="filename">./solr/collection1/conf/segmenter/suffix.txt</str>
      <bool name="useLatLon">false</bool>
      <bool name="useBoostQuery">true</bool>

<requestHandler name="/qs" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="defType">edismax</str>
    <!-- Other dismax params... -->
  <arr name="first-components">


It can be used like this:


For example, if “solr” is in the dictionary of projects (i.e. in the projects.txt file), the query will be rewritten to “project:solr”.

Add a feature to support using boost query instead of filter. One of the use case is when searching person's name with suffix, the default behavior is to expect the name with suffix match to be boosted with higher relevancy instead of only showing the suffix matches.

For example, if searching for "John Smith Jr", when useBoostQuery is set to true in the segmenter configuration, the query will be rewritten to bq=suffix:Jr.


This SearchComponent is used to alter the user location if a segment of the query is a centroid. It must be used within a RequestHandler that uses a location filter (bbox or geofilt). If a match is found, the user location (pt request param, which is required) is changed to the center location of the centroid. The effect will be that instead of using the user location for the location filter, it will use the centroid location. If multiple centroid segments are returned from the user query, the closest centroid to the original user location is used.

For example, if a user searches for “pizza Aaronsburg”, the segment “Aaronsburg” could be returned as a centroid with location 40.9068, -77.4081. This location would be used instead of the original location. This would filter the results to keep only the documents around the centroid location.


First, we need to define the SearchComponent (in solrconfig.xml) :

<searchComponent name="centroidcomp"
  <str name="filename">${solr.solr.home}/${}/conf/segmenter/centroid.csv</str>
  <str name="separator">|</str>

The “filename” parameter allows to set the centroid dictionary file. The “separator” parameter is used to specify the separator in the dictionary file (default is comma). Only one dictionary can be used.

Next, we need to add this component to a request handler:

<requestHandler name="/centroid" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="echoParams">explicit</str>
    <str name="sfield">location</str>
    <str name="fq">{!geofilt}</str>
    <str name="q.alt">*:*</str>
    <str name="d">75</str>
    <arr name="first-components">

It is important to use “first-components” to insert the centroid component in the request handler because it needs to alter the user location before other components of the request handler access it.

Also note that bbox could have been used instead of geofilt.

Another thing to note is usage of : - with CentroidComponent, it is possible original user’s query will be transformed into empty string. To handle such cases, you should define q.alt which will be used by Solr instead. In this case, we used match-all query (which is typically used in similar cases).


To use it with the request handler defined in the last section:


The pt parameter is the user location. But, if Aberdeen is found to be a centroid segment, the user location will be replaced by the precise centroid location.


QuerySegmenter is released under Apache License, Version 2.0


For any questions ping @sematext