Query performance with not existing relationship checks in Neo4j

Hi,

we are trying to optimize a query but the time explodes (~20 seconds) when having around 40K nodes in the database, but it should be way faster.

First, I will describe a simplified description of our schema. We have the following nodes:

  • Usergroup
  • Feature
  • Asset
  • Section

We also have the following relationships:

  • A Feature has only one Section (IS_IN_SECTION)
  • A Feature has one or more Asset (CONTAINS_ASSET)
  • An asset may be restricted for a Usergroup (HAS_RESTRICTED_ASSET)
  • A Feature may be restricted for a Usergroup (HAS_RESTRICTED_FEATURE)
  • A Section, and therefore, all the Feature of that Section, may be restricted for a Usergroup (HAS_RESTRICTED_SECTION)
  • A Usergroup may have a parent Usergroup (HAS_PARENT_GROUP) and it should fulfill its restrictions and those of its parents

The goal is, given a Usergroup, to list the top 20 assets ordered by date, that don't have any restrictions with the Usergroup.

The current query is similar to:

(1)
MATCH path=(:UserGroup {uid: $usergroup_uid})-[:HAS_PARENT_GROUP*0..]->(root:UserGroup)
  WHERE NOT (root)-[:HAS_PARENT_GROUP]->(:UserGroup)
  WITH nodes(path) AS usergroups
  UNWIND usergroups AS ug

(2)
MATCH (node:Asset)
  WHERE AND NOT (node)<-[:CONTAINS_ASSET]-(:Feature)-[:IS_IN_SECTION]->(:Section)<-[:HAS_RESTRICTED_SECTION {restriction_type: "view"}]-(ug) 
  AND NOT (node)<-[:HAS_RESTRICTED_ASSET {restriction_type: "view"}]-(ug)
  AND NOT (node)<-[:CONTAINS_ASSET]-(:Feature)<-[:HAS_RESTRICTED_FEATURE {restriction_type: "view"}]-(ug)

RETURN DISTINCT node 
ORDER BY node.date DESC
SKIP 0
LIMIT 20

We have a few more types of restrictions but here we have the main idea.

Some observations we have made are:

  • If we execute the query part (1) adding return ug after unwind, this query is solved in 1ms
  • If we change the query part (1) to MATCH (ug:Usergroup {uid: $usergroup_uid}) ignoring the parent groups, the query is solved in around 800ms. And if we add back the original part (1) it is solved in 8 seconds even if the Usergroup has no parents.

Currently, our database is small compared to the expected number of nodes (~6 millions), and the number of restrictions will grow, and we need to optimize this kind of queries.

For that, we have these questions:

  • The NOT <restrictions> (ex: NOT (node)<-[:HAS_RESTRICTED_ASSET {restriction_type: "view"}]-(ug)) conditions is correct in this kind of situation or are there other approachs to get the job done more efficiently?
  • Do we need any type of index?
  • Is the structure of the schema right, or are there any inefficiencies?
  • How can we rewrite the part (1) of the query or what do you thinks is causing the overhead with it?

The database version is Neo4j 3.5.X
Captures of the query profile are:

Thanks in advance

Hello @bennu_neo , I'm a workmate of @ainsausti , and I have succesfully implemented your suggested query, but I would like to introduce the {restriction_type: "view"} condition. Could you help us with this?

Thank you so much.

Hi @Isidoro !

Welcome to the community!

About this query condition. We have 2 solutions/options.

1. Remodeling. How many restriction types do you have? does have sense on your model having different relationships for every different relationship type? In that case, your can use that especific relationship on your APOC filter.

2. Creating your own traverser. You can 'easily' extend apoc in order to add this kind of conditions in a custom traverser. Do you have any experience forking public repositories?

Hahaha, sorry, that was not my intention, I misread the post and interpreted that you were proposing that query :pray:

Hi again @bennu_neo.

As I understand, with the first option you propose to convert the restriction property in a relationship. For example, we have download restrictions and view restrictions, so instead of having HAS_RESTRICTED_ASSET with restriction_type:"view" or restriction_type: "download", we could have HAS_RESTRICTED_VIEW_ASSET and HAS_RESTRICTED_DOWNLOAD_ASSET. Is this correct? This option is feasible, but we may need to migrate the existing structure in our production environment.

The second option is not valid for us because we do not want to maintain a fork customized branch.

Thank you very much for your response!

Hi @ainsausti !

Exactly. The first option may force you to remodel (something perfectly normal) in order to optimize other traversals but it will keep the performance of the query I showed you. I understand you position about the second option and it's completely valid. You can always create your own procedure that executes the traversal the way you want if changing the model became eventually not possible.

Soon this kind of queries will become much easier :wink:

Hi again,

optimization issues are coming back :sweat_smile:

We have now around 600K assets in the production database and the final load that the database should stand is around 4 million assets.

Our query now looks like this:

MATCH (u:UserGroup {uid: "fake_usergroup_uid"})
CALL apoc.path.subgraphNodes(u, {

	relationshipFilter: "HAS_PARENT_GROUP>|HAS_RESTRICTED_ASSET_VIEW>|CONTAINS_ASSET>|HAS_RESTRICTED_FEATURE_VIEW>|HAS_RESTRICTED_SECTION_VIEW|<IS_IN_SECTION|HAS_RESTRICTED_PROVIDER_VIEW|<HAS_PROVIDER|HAS_RESTRICTED_CATEGORY_VIEW|<HAS_CATEGORY",

	labelFilter: "/Asset"

	})

	yield node
	with collect(id(node)) as bl
	MATCH (node:Asset)
	WHERE not id(node) in bl
	WITH node
	WHERE node.logically_deleted = false 
	RETURN DISTINCT node.date ORDER BY node.date DESC 
        SKIP 0 LIMIT 50

This query is completed in around 3~4 seconds. But this is the base query, more filters could be added dynamically making the query last more than 10 seconds.

After a bit of research, we've realized that by removing the "order by" statement from the query, the completion time is reduced to 30 ms, so we've been focused on reducing the time of the "order by" with indexes. The experiments we've tried are:

  • Simply adding an index to Asset node.date -> Without significant improvement.
  • With the index, adding an artificial where clause to activate the index WHERE node.date is not NULL as in the example showed in the docs (https://neo4j.com/docs/cypher-manual/current/query-tuning/advanced-example/))-> Without significant improvements
  • We thought that maybe our neo4j version was outdated and didn't applied the index correctly (v3.5.14), we've migrated to v4.4.11 and tried -> Without significant improvements.

Is there something that we are not understanding about how indexes work in order by clauses?

Another piece of information that may be impacting the performance is that node.date is not a native Neo4j date but a string with the format "YYYY-MM-DD". Should transforming the date to a native data structure improve the performance?

Thanks in advance

Hey @ainsausti !

Good to see you around here again. Let's build a case...

First thing, Why does the relationship HAS_RESTRICTED_SECTION_VIEW has no relationship attached on your subgraph expansion?

What's the order of magnitude (cardinality) of assets on that subgraph expansion?

What's the count of Assets in general? What's the percentage of them with the property logically_deleted being true/false?

See u around

Hi again @bennu_neo! Thanks for your fast response!

As for the first question, sorry, but I don't really understand what you mean with "has no relationship attached". Do you mean why it doesn't have a direction specified?

The asset count we currently have is 537446.

There are 8700 assets restricted for the usergroup we are testing that match with the subgraph expansion. On average, the cardinality of these nodes is ~13 relationships.

The percentage of assets with the property logically deleted being true is very low -> 0.003% (17 assets)

I've also attached the profile in png.

My bad @ainsausti , I was trying to mean "direction".

How long does the first query takes? (just the apoc).

So if I understand properly, you look for some Assets in your first query but then you want EVERY Asset that's not softly deleted that it's not part of your previous list (while also ordering)?

In other thoughts @ainsausti ,

Do you have an ids on Assets with an index? Can you create a composite index with the delte property and try something like:

MATCH (u:UserGroup {uid: "fake_usergroup_uid"})
CALL apoc.path.subgraphNodes(u, {

	relationshipFilter: "HAS_PARENT_GROUP>|HAS_RESTRICTED_ASSET_VIEW>|CONTAINS_ASSET>|HAS_RESTRICTED_FEATURE_VIEW>|HAS_RESTRICTED_SECTION_VIEW|<IS_IN_SECTION|HAS_RESTRICTED_PROVIDER_VIEW|<HAS_PROVIDER|HAS_RESTRICTED_CATEGORY_VIEW|<HAS_CATEGORY",

	labelFilter: "/Asset"
	})
	yield node
	with collect(node.id) as bl
	MATCH (node:Asset)
	WHERE not node.id in bl
	AND node.logically_deleted = false 
	RETURN DISTINCT node.date ORDER BY node.date DESC 
        SKIP 0 LIMIT 50

And share the profile. It may not solve the issue but we can get some ideas :slightly_smiling_face:

Hi again!

There is no reason why the direction is not set in some of the relationship filters, I've added them.

The apoc query takes 32 ms. So, I think, it's not the problematic part of the query.

You understood correctly the intention of the query. Nonetheless, this is the base query, some filters are added dynamically to this one. For example, to find the assets with a description equal to "Test description", the condition "AND node.description = 'Test description' " will be added.

We have a "uid" property with a unique index. I've added a composite index on Asset.uid and Asset.logically_deleted and chnged the query. But still there is no improvement. I think the index are not applied in the execution.

The profile with the index is:

plan (1).png

I think the index was created succesfully:

ainsausti_0-1663073199498.png