0

I'm new to Elastic. I'm attempting to do a proof-of-concept for professional reasons. So far I'm very impressed. I've indexed a bunch of data and have run a few queries - almost all of which are super fast (thumbs up).

The only issue I'm encountering is that my date range query seems relatively slow compared to all my other queries. We're talking 1000ms+ compared to <100ms for everything else.

I am using the NEST .NET library.

My document structure looks like this:

{ 
   "tourId":"ABC123",
   "tourName":"Super cool tour",
   "duration":12,
   "countryCode":"MM",
   "regionCode":"AS",
   ...
   "availability":[ 
      { 
         "startDate":"2021-02-01T00:00:00",
         ...
      },
      { 
         "startDate":"2021-01-11T00:00:00",
         ...
      }
   ]
}

I'm trying to get all tours which have availability within a certain month. I am using a date range to do this. I'm not sure if there's a more efficient way to do this? Please let me know if so.

I have tried the following two query:

var response = await elastic.SearchAsync<Tour>(s => s
    .Query(q => q
        .Nested(n => n
            .Path(p => p.Availability)
            .Query(nq => nq
                .DateRange(r => r
                    .Field(f => f.Availability.First().StartDate)
                    .GreaterThanOrEquals(new DateTime(2020, 07, 01))
                    .LessThan(new DateTime(2020, 08, 01))
                )
            )
        )
    )
    .Size(20)
    .Source(s => s.IncludeAll().Excludes(e => e.Fields(f => f.Availability)))
);

I basically followed the example on their documentation here: https://www.elastic.co/guide/en/elasticsearch/client/net-api/current/writing-queries.html#structured-search but I'm not sure that this is the best way for me to achieve this. Is it just that a date range is naturally slower than other queries or am I just doing it wrong?!

EDIT:

I tried added a new field named YearMonth which was just an integer representing the year and month for each availability in the format yyyyMM and querying against this. The timing was also around one second. This makes me wonder whether it's not actually an issue with the date but something else entirely.

I have run a profiler on my query and the result is below. I have no idea what most of it means so if someone does and can give me some help that'd be great:

Query:

var response = await elastic.SearchAsync<Tour>(s => s
    .Query(q => q
        .Nested(n => n
            .Path(p => p.Availability)
            .Query(nq => nq
                .Term(t => t
                    .Field(f => f.Availability.First().YearMonth)
                    .Value(202007)
                )
            )
        )
    )
    .Size(20)
    .Source(s => s.IncludeAll().Excludes(e => e.Fields(f => f.Availability)))
    .Profile()
);

Profile:

{ 
   "Shards":[ 
      { 
         "Aggregations":[ 

         ],
         "Id":"[pr4Os3Y7RT-gXRWR0gxoEQ][tours][0]",
         "Searches":[ 
            { 
               "Collector":[ 
                  { 
                     "Children":[ 
                        { 
                           "Children":[ 

                           ],
                           "Name":"SimpleTopDocsCollectorContext",
                           "Reason":"search_top_hits",
                           "TimeInNanoseconds":6589867
                        }
                     ],
                     "Name":"CancellableCollector",
                     "Reason":"search_cancelled",
                     "TimeInNanoseconds":13981165
                  }
               ],
               "Query":[ 
                  { 
                     "Breakdown":{ 
                        "Advance":5568,
                        "BuildScorer":2204354,
                        "CreateWeight":25661,
                        "Match":0,
                        "NextDoc":3650375,
                        "Score":3795517
                     },
                     "Children":null,
                     "Description":"ToParentBlockJoinQuery (availability.yearMonth:[202007 TO 202007])",
                     "TimeInNanoseconds":9686512,
                     "Type":"ESToParentBlockJoinQuery"
                  }
               ],
               "RewriteTime":36118
            }
         ]
      }
   ]
}
Pieterjan
  • 1,691
  • 2
  • 14
  • 37
Andy Furniss
  • 3,644
  • 3
  • 28
  • 50

1 Answers1

1

Nevertheless, this seems like a data structure optimisation issue: Without changing too much you could convert all your available dates into Unix timestamp and then use Range query (quick conversion tips in C# can be found here).

Another one is to create monthly (or weekly, yearly depends on your data) indices and before executing your query filter out indices i.e. query only the indices you need. This would mean putting the same listings into multiple indices (duplicate documents in multiple indices) depending on the availability month/day.

Separating timestamp (time-series) data per certain index granularity is a common practice in ES. More info here.

The latter would mean that you would filter on a DateTime field rather than an array of timestamp.

Id personally go with the second option.

Neil Varnas
  • 741
  • 7
  • 28
  • "You're doing a .First() which I think means that only the first element from your dates array will be filtered." - this is not true; the _expression_ passed to `.Field()` is used to build a string path to the field on the model by visiting the expression. – Russ Cam Oct 13 '19 at 23:09
  • You're absolutely right, overlooked that the array has/would have other named objects. Will remove unnecessary comment. – Neil Varnas Oct 14 '19 at 07:38
  • Thanks for your reply Neil. Your suggestion sounds good and I will look into them for extra efficiency. However, after some further fiddling, I've discovered that I was actually encountering the standard delay on the first query when using NEST (https://stackoverflow.com/q/44725584/5392786). I change the order of my queries around and discovered that the first one always takes around a second, regardless of how simple the query actually is. The month query I was trying was <100ms if it wasn't the first to run! – Andy Furniss Oct 14 '19 at 10:42