Query performance with not existing relationship checks in Neo4j

Hi,

we are trying to optimize a query but the time explodes (~20 seconds) when having around 40K nodes in the database, but it should be way faster.

First, I will describe a simplified description of our schema. We have the following nodes:

  • Usergroup
  • Feature
  • Asset
  • Section

We also have the following relationships:

  • A Feature has only one Section (IS_IN_SECTION)
  • A Feature has one or more Asset (CONTAINS_ASSET)
  • An asset may be restricted for a Usergroup (HAS_RESTRICTED_ASSET)
  • A Feature may be restricted for a Usergroup (HAS_RESTRICTED_FEATURE)
  • A Section, and therefore, all the Feature of that Section, may be restricted for a Usergroup (HAS_RESTRICTED_SECTION)
  • A Usergroup may have a parent Usergroup (HAS_PARENT_GROUP) and it should fulfill its restrictions and those of its parents

The goal is, given a Usergroup, to list the top 20 assets ordered by date, that don't have any restrictions with the Usergroup.

The current query is similar to:

(1)
MATCH path=(:UserGroup {uid: $usergroup_uid})-[:HAS_PARENT_GROUP*0..]->(root:UserGroup)
  WHERE NOT (root)-[:HAS_PARENT_GROUP]->(:UserGroup)
  WITH nodes(path) AS usergroups
  UNWIND usergroups AS ug

(2)
MATCH (node:Asset)
  WHERE AND NOT (node)<-[:CONTAINS_ASSET]-(:Feature)-[:IS_IN_SECTION]->(:Section)<-[:HAS_RESTRICTED_SECTION {restriction_type: "view"}]-(ug) 
  AND NOT (node)<-[:HAS_RESTRICTED_ASSET {restriction_type: "view"}]-(ug)
  AND NOT (node)<-[:CONTAINS_ASSET]-(:Feature)<-[:HAS_RESTRICTED_FEATURE {restriction_type: "view"}]-(ug)

RETURN DISTINCT node 
ORDER BY node.date DESC
SKIP 0
LIMIT 20

We have a few more types of restrictions but here we have the main idea.

Some observations we have made are:

  • If we execute the query part (1) adding return ug after unwind, this query is solved in 1ms
  • If we change the query part (1) to MATCH (ug:Usergroup {uid: $usergroup_uid}) ignoring the parent groups, the query is solved in around 800ms. And if we add back the original part (1) it is solved in 8 seconds even if the Usergroup has no parents.

Currently, our database is small compared to the expected number of nodes (~6 millions), and the number of restrictions will grow, and we need to optimize this kind of queries.

For that, we have these questions:

  • The NOT <restrictions> (ex: NOT (node)<-[:HAS_RESTRICTED_ASSET {restriction_type: "view"}]-(ug)) conditions is correct in this kind of situation or are there other approachs to get the job done more efficiently?
  • Do we need any type of index?
  • Is the structure of the schema right, or are there any inefficiencies?
  • How can we rewrite the part (1) of the query or what do you thinks is causing the overhead with it?

The database version is Neo4j 3.5.X
Captures of the query profile are:



Thanks in advance

Is the logic in query 2 correct? You get a user’s groups in query 1. You then search group-by-group for all the nodes that that one group doesn’t have restrict access to. You then return the union of all these nodes that have been found to be unrestricted by at least one group. Don’t you need to find the nodes that are found to be unrestricted by all groups? In your result you could have a node that is found unrestricted by one group A, included in the result, but later found to be restricted by group B. This assets will be included in the result with the present logic.

Your algorithm needs to interrogate every asset node to find the ones that are not restricted. To do so, every path from that asset needs to be determinedw not to match one of the three patterns. I don’t think performance is going to scale well as the number of assets grows.

Maybe approaching this from the opposite direction would be better. Find the assets that the user is restricted from viewing. This is easy since you have paths to each restricted paths. You then collect the uniques asset identifier for each restricted asset. This should be a fast query. Now you need the asset nodes whose unique identifier is not in this list. This is a simple query and should be fast if you have an index on the property.

The question is how big of a list can cypher handle for a ‘not in’ clause. Oracle has a limit of 1000 elements, we approach that restriction by partitioning large lists into lists of a 1000 elements and ORing the results. What is the number of restricted assets for a typical user?

Hello @ainsausti !

As @glilienfield suggested, retrieving nodes that not supposed to be traversed and then using their identifiers in order to remove them from a general match may be a better approach.

Quick question, do you APOC on your installation?

Bennu

1 Like

@bennu_neo, I feel an efficient apoc solution coming.

@glilienfield Allegedly guilty! :sweat_smile:

Thanks for the responses @bennu_neo and @glilienfield.

You are right, we've just tested this, and it's making the union of the lists. :man_facepalming:

We've considered this approach but we may have built the query incorrectly because all the attempts to face it in this way were even slower.

Restricted directly from a usergroup to an asset through the HAS_RESTRICTED_ASSET relationship there could be tipically less than 1000, but indirectly trhough the other relationships HAS_RESTRICTED_FEATURE, HAS_RESTRICTED_SECTION we could have 100K or more.

Yes we have APOC installed. So any suggestion with this approach will be valid for us

Nice!

So let's play together a little bit. I'll not consider the property condition HAS_RESTRICTED_ASSET for a while so we can first get an idea of the timing like this.

MATCH (u:UserGroup {uid: $usergroup_uid})
WITH u
CALL apoc.path.subgraphNodes(u, {
	relationshipFilter: "HAS_PARENT_GROUP>"
}) yield node
WITH collect(node) AS usergroups
CALL apoc.path.subgraphNodes(usergroups, {
	relationshipFilter: "HAS_RESTRICTED_ASSET>|CONTAINS_ASSET>|HAS_RESTRICTED_FEATURE>|HAS_RESTRICTED_SECTION|<IS_IN_SECTION",
    labelFilter: "/Asset"
}) yield node
with collect(id(node)) as bl
MATCH (node:Asset)
  WHERE id(node) not in bl
RETURN node 
ORDER BY node.date DESC
LIMIT 20

Let see if this runs :smiley:

One way to fix the logic to ensure all user groups don't have restricted access for an Asset node is to use the 'all' predicate. The following should work. I have included this for educational purposes so you can see how to solve the problem. Like @bennu_neo and I stated, your original approach doesn't seem like it will scale well. Try @bennu_neo's query to see how that works.

MATCH path=(:UserGroup {uid: $usergroup_uid})-[:HAS_PARENT_GROUP*0..]->(root:UserGroup)
  WHERE NOT (root)-[:HAS_PARENT_GROUP]->(:UserGroup)
  WITH nodes(path) AS usergroups
  
MATCH (node:Asset)
  WHERE all(ug in usergroups where NOT (node)<-[:CONTAINS_ASSET]-(:Feature)-[:IS_IN_SECTION]->(:Section)<-[:HAS_RESTRICTED_SECTION {restriction_type: "view"}]-(ug) 
  AND NOT (node)<-[:HAS_RESTRICTED_ASSET {restriction_type: "view"}]-(ug)
  AND NOT (node)<-[:CONTAINS_ASSET]-(:Feature)<-[:HAS_RESTRICTED_FEATURE{restriction_type:"view"}]-(ug))

RETURN DISTINCT node 
ORDER BY node.date DESC
SKIP 0
LIMIT 20

Now that I looked back on my query, this should also work:

MATCH (u:UserGroup {uid: $usergroup_uid})
CALL apoc.path.subgraphNodes(u, {
	relationshipFilter: "HAS_PARENT_GROUP>|HAS_RESTRICTED_ASSET>|CONTAINS_ASSET>|HAS_RESTRICTED_FEATURE>|HAS_RESTRICTED_SECTION|<IS_IN_SECTION",
    labelFilter: "/Asset"
}) yield node
with collect(id(node)) as bl
MATCH (node:Asset)
  WHERE id(node) not in bl
RETURN node 
ORDER BY node.date DESC
LIMIT 20

Bennu

1 Like

Thank you!! @bennu_neo and @glilienfield

I have tried the first query proposed by @bennu_neo, and it's incredibly fast (~100ms), I've also tried the one of @glilienfield, but this is still slow (~25s).

The following step is adapt and add the APOC query to our tests and verify that it works as expected.

I expect to do this tomorrow and I'll tell you back the results.

That is excellent. Hey, don't label that as my query. I didn't expect it to work well.

Hi @ainsausti !

Keep in mind that my query should be equivalent to yours without the [:HAS_RESTRICTED_ASSET {restriction_type: "view"}] condition and just [:HAS_RESTRICTED_ASSET ]. Keep that on mind when comparing results. Then if they are they same, we can talk about ways to introduce that condition without loosing our performance.

Oh @glilienfield , don't abandon your child (query)

1 Like