Optimizing Relationship query with Where filters

risingodegua · November 10, 2021, 8:25am

Hi folks

I have database with about 7m nodes and relationships, and I'm currently executing the following query:

CALL db.index.fulltext.queryNodes('organization', 'food' )
YIELD node
WITH node as org LIMIT 5000
MATCH (org)<-[:LOCATED_IN]-(loc:Location)
WHERE loc.zipCode="75219" OR loc.zipCode="75001" OR loc.zipCode="75002" OR loc.zipCode="75006" OR loc.zipCode="75007" OR loc.zipCode="75009" OR loc.zipCode="75010" OR loc.zipCode="75011" OR loc.zipCode="75013" OR loc.zipCode="75014" OR loc.zipCode="75015" OR loc.zipCode="75016" OR 
RETURN DISTINCT org, loc
SKIP 0 LIMIT 5

The query above simply does a full-text search with the word food and retrieves all related location nodes where the zipcode is any of the above.

This works fine, but it's relatively slow, and I would appreciate any help to make it faster.
Using the PROFILE tool, I can see that the bottleneck is in two places, the MATCH and WHERE clauses.

So my question is, is there an optimal way of doing this to make it faster?. The current speed is approximately 30-40 seconds.

Bennu · November 10, 2021, 6:11pm

Hi @risingodegua!

Let's try two things.

First

CREATE INDEX ZIP_LOCATION FOR (n:Location) ON (n.zipCode);

It will take some time before it arrives to 100% completation, so check with call db.indexes in order to be sure that is populated before testing second part.

Then...

CALL db.index.fulltext.queryNodes('organization', 'food' )
YIELD node
WITH node as org LIMIT 5000
MATCH (org)<-[:LOCATED_IN]-(loc:Location)
WHERE loc.zipCode in ["75219", ...]
RETURN DISTINCT org, loc
SKIP 0 LIMIT 5

Lemme know how it goes.

Bennu

andrew_bowman · November 10, 2021, 7:46pm

Do you think this query is asking the right question? I'm wondering if the approach would work better if you flipped the ordering.

Right now, you find any 5000 food organizations, then of those 5000, you check if any are located in the zip codes given and try to limit it to 5. Your PROFILE plan shows that none of those 5000 food organizations are in those zip codes, so nothing gets returned.

I would think it would be more useful to find food organizations in those zip codes, where you either start from an index lookup on the zipCode, or a composite index lookup by zipCode and "food" organization value, though that would require some duplication of properties I think on the node that is looked up. In any case, I think that would lead to more relevant results, instead of relying on the first random 5000 nodes of the fulltext index to be present in the zip codes filtered.

risingodegua · November 11, 2021, 6:53am

Thanks for the suggestion @andrew_bowman

I considered this approach as well, and if possible I'm sure it's definitely better. The only problem I have is how to do a full text search on the returned results for location.

Something like:

MATCH (org)<-[:LOCATED_IN]-(loc:Location)
WHERE loc.zipCode="75219" OR loc.zipCode="75001" OR loc.zipCode="75002" OR loc.zipCode="75006" OR loc.zipCode="75007" OR loc.zipCode="75009" OR loc.zipCode="75010" OR loc.zipCode="75011" OR loc.zipCode="75013" OR loc.zipCode="75014" OR loc.zipCode="75015" OR loc.zipCode="75016"
WITH org, loc
CALL db.index.fulltext.queryNodes("organization", 'food' )
YIELD node
RETURN DISTINCT node, loc
SKIP 0 LIMIT 5

Is it possible to do a full text search on filtered nodes?

Update:

I tried the following and it works way faster:

MATCH (org)<-[:LOCATED_IN]-(loc:Location)
WHERE 
 (org.description CONTAINS 'food' OR org.name CONTAINS 'food' OR org.classification CONTAINS 'food')
AND 
 (loc.zipCode IN ["75219","75001","75002","75006","75007","75009","75010" ,"75011","75013","75014","75015","75016"])
RETURN DISTINCT org, loc
SKIP 0 LIMIT 5

Although this isn't as powerful as FTS, but it improves speed

risingodegua · November 11, 2021, 6:53am

Thanks for the suggestion @Bennu

Sadly, I get the same performance after creating the index.

Bennu · November 11, 2021, 7:44am

Hi!

Can you share a profile if this newer version of the query?

Bennu

risingodegua · November 11, 2021, 7:49am

Sure, here it is:

risingodegua · November 11, 2021, 7:53am

I tried this, but regex doesn't seem to work.

 MATCH (org:Organization)<-[:LOCATED_IN]-(loc:Location)
 WHERE (
       org.name=~ '.*food.*' OR org.description=~ '.*food.*' OR org.classification=~ '.*food.*'
      )
 AND (
        loc.zipCode IN ["75219","75001","75002","75006","75007","75009","75010","75011","75013","75014","75015","75016","75017","75019","75022","75023","75024","75025","75026","75027"]
     )
RETURN DISTINCT org, loc
SKIP 0 LIMIT 5

This would have been better if it worked.

Update!!

This works and is way faster. I added a case-insensitive regex check. The final solution is:

PROFILE
MATCH (org:Organization)<-[:LOCATED_IN]-(loc:Location)
 WHERE (
       org.name=~ '(?i).*Food.*' OR org.description=~ '(?i).*Food.*' OR org.classification=~ '(?i).*Food.*'
      )
 AND (
        loc.zipCode IN ["75219","75001","75002","75006","75007","75009","75010","75011","75013","75014","75015","75016","75017","75019","75022","75023","75024","75025","75026","75027"]
     )
RETURN DISTINCT org, loc
SKIP 0 LIMIT 5

Topic		Replies	Views
How to Optimize a Query with Huge Where Clause Cypher performance , cypher	7	780	June 16, 2020
Help with optimizing query Cypher performance , browser , cypher , operations	3	416	September 21, 2021
Optimising query performance with a relatively simple match Cypher performance	3	637	July 3, 2020
Slow cypher query computation Cypher	4	1197	April 18, 2019
Querying relationships slow performance Cypher performance , cypher , relationship	4	1992	October 15, 2020

Optimizing Relationship query with Where filters

Related topics