Elasticsearch Analysis Synonym

Overview

Elasticsearch Analysis Synonym Plugin provides NGramSynonymTokenizer. For more details, see LUCENE-5252.

Version

Versions in Maven Repository

Issues/Questions

Please file an issue. (Japanese forum is here.)

Installation

For 5.x

$ $ES_HOME/bin/elasticsearch-plugin install org.codelibs:elasticsearch-analysis-synonym:5.3.0

For 2.x

$ $ES_HOME/bin/plugin install org.codelibs/elasticsearch-analysis-synonym/2.4.0

Getting Started

Create synonym.txt File

First of all, you need to create a synonym dictionary file, synonym.txt in $ES_CONF(ex. /etc/elasticsearch). (The following content is just a sample...)

$ cat /etc/elasticsearch/synonym.txt
あ,かき,さしす,たちつて,なにぬねの

Create Index

NGramSynonymTokenizer is defined as "ngram_synonym" type. Creating an index with "ngram_synonym" is below:

$ curl -XPUT localhost:9200/sample?pretty -d '
{
  "settings":{
    "index":{
      "analysis":{
        "tokenizer":{
          "2gram_synonym":{
            "type":"ngram_synonym",
            "n":"2",
            "synonyms_path":"synonym.txt"
          }
        },
        "analyzer":{
          "2gram_synonym_analyzer":{
            "type":"custom",
            "tokenizer":"2gram_synonym"
          }
        }
      }
    }
  },
  "mappings":{
    "item":{
      "properties":{
        "id":{
          "type":"string",
          "index":"not_analyzed"
        },
        "msg":{
          "type":"string",
          "analyzer":"2gram_synonym_analyzer"
        }
      }
    }
  }
}'

and then insert data:

$ curl -XPOST localhost:9200/sample/item/1 -d '
{
  "id":"1",
  "msg":"あいうえお"
}'

Check Search Results

Try searching...

$ curl -XPOST "http://localhost:9200/sample/_search" -d '
{
   "query": {
      "match_phrase": {
         "msg": "あ"
      }
   }
}'

$ curl -XPOST "http://localhost:9200/sample/_search" -d '
{
   "query": {
      "match_phrase": {
         "msg": "あい"
      }
   }
}'

$ curl -XPOST "http://localhost:9200/sample/_search" -d '
{
   "query": {
      "match_phrase": {
         "msg": "かき"
      }
   }
}'

$ curl -XPOST "http://localhost:9200/sample/_search" -d '
{
   "query": {
      "match_phrase": {
         "msg": "かきい"
      }
   }
}'

Reload synonyms_path File Dynamically

To add "dynamic_reload" property as true, NGramSynonymTokenizer reloads synonyms_path file on the fly(actually, it's reload on reset() method call). If you want to change an interval time to check a file timestamp, add "reload_interval".

$ curl -XPUT localhost:9200/sample?pretty -d '
{
  "settings":{
    "index":{
      "analysis":{
        "tokenizer":{
          "2gram_synonym":{
            "type":"ngram_synonym",
            "n":"2",
            "synonyms_path":"synonym.txt",
            "dynamic_reload":true,
            "reload_interval":"10s"
          }
        },
...