0

Given the following documents in an index (lets call it addresses):

{
    ADDRESS: {
        ID: 1,
        LINE1: "steet 1",
        CITY: "kuala lumpur",
        COUNTRY: "MALAYSIA",
        ...
    } 
}
{
    ADDRESS: {
        ID: 2,
        LINE1: "steet 1",
        CITY: "kualalumpur city",
        COUNTRY: "MALAYSIA",
        ...
    }
}
{
    ADDRESS: {
        ID: 3,
        LINE1: "steet 1",
        CITY: "kualalumpur",        
        COUNTRY: "MALAYSIA",
        ...
    }
}
{
    ADDRESS: {
        ID: 4,
        LINE1: "steet 1",
        CITY: "kuala lumpur city",      
        COUNTRY: "MALAYSIA",
        ...
    }
}

At this point, I found the query to grab "kualalumpur", "kuala lumpur", "kualalumpur city" with the search text "kualalumpur".
But "kuala lumpur city" is missing from the result despite near similarity with "kualalumpur city".

Here is my query so far:

{
  "query": {
    "bool": {
      "should": [
          {"match": {"ADDRESS.STREET": {"query": "street 1", "fuzziness": 1, "operator": "AND"}}},
          {
            "bool": {
              "should": [
                {"match": {"ADDRESS.CITY": {"query": "kualalumpur", "fuzziness": 1, "operator": "OR"}}},
                {"match": {"ADDRESS.CITY.keyword": {"query": "kualalumpur", "fuzziness": 1, "operator": "OR"}}}
              ]
            }
          }
        ],
      "filter": {
        "bool": {
          "must": [
            {"term": {"ADDRESS.COUNTRY.keyword": "MALAYSIA"}}
          ]
        }
      },
      "minimum_should_match": 2
    }
  }
}

Given the condition, is it possible at all for Elasticsearch to return all four documents with search text "kualalumpur"?

fruqi
  • 4,427
  • 4
  • 25
  • 32
  • it would be great if you can let me know if it solved your problem. – Amit Jul 20 '20 at 03:06
  • Hey! Indeed it does! Thanks, one follow up question, what is the benefit of choosing edge n-gram in this case instead of n-gram? – fruqi Jul 20 '20 at 03:09
  • 1
    Glad it helped and thanks for upvote and accepting answer, you need prefix kind of search like `kualalumpur` not `ualal` which is infix search and costly and edge- n gram creates much less tokens and more performant which suited your use case – Amit Jul 20 '20 at 03:15
  • That makes sense. Thanks a lot for the explanation :) – fruqi Jul 20 '20 at 03:19
  • No issues, my pleasure – Amit Jul 20 '20 at 03:48
  • 1
    Please go through my detailed answer https://stackoverflow.com/a/60584211/4039431 on same topic where I linked my blog as well and don't forgot to upvote if you like the answer :) – Amit Jul 20 '20 at 03:49
  • 1
    Awesome! Thanks again:D – fruqi Jul 20 '20 at 06:32

1 Answers1

1

You can use edge-n gram tokenizer on the country field to get the all four docs, tried it in my local and adding below working example.

Create custom analyzer and apply it on your field

{
    "settings": {
        "index": {
            "analysis": {
                "analyzer": {
                    "ngram_analyzer": {
                        "type": "custom",
                        "filter": [
                            "lowercase"
                        ],
                        "tokenizer": "edgeNGramTokenizer"
                    }
                },
                "tokenizer": {
                    "edgeNGramTokenizer": {
                        "token_chars": [
                            "letter",
                            "digit"
                        ],
                        "min_gram": "1",
                        "type": "edgeNGram",
                        "max_gram": "40"
                    }
                }
            },
            "max_ngram_diff": "50"
        }
    },
    "mappings": {
        "properties": {
            "country": {
                "type": "text",
                "analyzer" : "ngram_analyzer"
            }
        }
    }
}

Index your all four sample docs, like below

{
  "country" : "kuala lumpur"
}

search query with term kualalumpur matches all four docs

{
    "query": {
        "match" : {
            "country" : "kualalumpur"
        }
    }
}

 "hits": [
      {
        "_index": "fuzzy",
        "_type": "_doc",
        "_id": "3",
        "_score": 5.0003963,
        "_source": {
          "country": "kualalumpur"
        }
      },
      {
        "_index": "fuzzy",
        "_type": "_doc",
        "_id": "2",
        "_score": 4.4082437,
        "_source": {
          "country": "kualalumpur city"
        }
      },
      {
        "_index": "fuzzy",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.5621849,
        "_source": {
          "country": "kuala lumpur"
        }
      },
      {
        "_index": "fuzzy",
        "_type": "_doc",
        "_id": "4",
        "_score": 0.4956103,
        "_source": {
          "country": "kuala lumpur city"
        }
      }
    ]

 
Amit
  • 25,499
  • 6
  • 44
  • 72