Elasticsearch URL Tokenizer and URL Token Filter

This plugin enables URL tokenization and token filtering by URL part.

Build Status

Compatibility

Elasticsearch Version Plugin Version
5.6.3 5.6.3.0
5.6.1 5.6.1.0
5.5.1 5.5.1.0
5.5.0 5.5.0.0
5.2.2 5.2.2.0
5.2.1 5.2.1.1
5.1.1 5.1.1.0
5.0.0 5.0.0.1
2.4.3 2.4.3.0
2.4.1 2.4.1.0
2.4.0 2.4.0.0
2.3.5 2.3.5.0
2.3.4 2.3.4.3
2.3.3 2.3.3.5
2.3.2 2.3.2.1
2.3.1 2.3.1.1
2.3.0 2.3.0.1
2.2.2 2.2.3
2.2.1 2.2.2.1
2.2.0 2.2.1
2.1.1 2.2.0
2.1.1 2.1.1
2.0.0 2.1.0
1.6.x, 1.7.x 2.0.0
1.6.0 1.2.1
1.5.2 1.1.0
1.4.2 1.0.0

Installation

Elasticsearch v5

bin/elasticsearch-plugin install https://github.com/jlinn/elasticsearch-analysis-url/releases/download/v5.6.3.0/elasticsearch-analysis-url-5.6.3.0.zip

Elasticsearch v2

bin/plugin install https://github.com/jlinn/elasticsearch-analysis-url/releases/download/v2.4.3.0/elasticsearch-analysis-url-2.4.3.0.zip

Usage

URL Tokenizer

Options:

Example:

Index settings:

{
    "settings": {
        "analysis": {
            "tokenizer": {
                "url_host": {
                    "type": "url",
                    "part": "host"
                }
            },
            "analyzer": {
                "url_host": {
                    "tokenizer": "url_host"
                }
            }
        }
    }
}

Make an analysis request:

curl 'http://localhost:9200/index_name/_analyze?analyzer=url_host&pretty' -d 'https://foo.bar.com/baz.html'

{
  "tokens" : [ {
    "token" : "foo.bar.com",
    "start_offset" : 8,
    "end_offset" : 19,
    "type" : "host",
    "position" : 1
  }, {
    "token" : "bar.com",
    "start_offset" : 12,
    "end_offset" : 19,
    "type" : "host",
    "position" : 2
  }, {
    "token" : "com",
    "start_offset" : 16,
    "end_offset" : 19,
    "type" : "host",
    "position" : 3
  } ]
}

URL Token Filter

Options:

Example:

Set up your index like so:

{
    "settings": {
        "analysis": {
            "filter": {
                "url_host": {
                    "type": "url",
                    "part": "host",
                    "url_decode": true,
                    "tokenize_host": false
                }
            },
            "analyzer": {
                "url_host": {
                    "filter": ["url_host"],
                    "tokenizer": "whitespace"
                }
            }
        }
    },
    "mappings": {
        "example_type": {
            "properties": {
                "url": {
                    "type": "multi_field",
                    "fields": {
                        "url": {"type": "string"},
                        "host": {"type": "string", "analyzer": "url_host"}
                    }
                }
            }
        }
    }
}

Make an analysis request:

curl 'http://localhost:9200/index_name/_analyze?analyzer=url_host&pretty' -d 'https://foo.bar.com/baz.html'

{
  "tokens" : [ {
    "token" : "foo.bar.com",
    "start_offset" : 0,
    "end_offset" : 32,
    "type" : "word",
    "position" : 1
  } ]
}