Finding nodes by multiple required paths

I have an interesting scenario that I would appreciate some feedback on. The general issue is the following: Given a starting node, how can I find a set of endpoints which can be reached by a set of paths, when each path must go through a specific set of other nodes. I'm using Neo4J v4.4 currently.

The diagram has the general setup as follows:

  • Reference data nodes (in blue, with labels "Type", "Geo" and "Subject" are each a hierarchy). For example in Geo, the top level would be the entire world, with a child node of "North America", with a child node of "United States", etc.
  • Documents are tagged with one or more reference data nodes for each of the reference data types.
  • A Filter condition is applied which defined content that a user has access to. Each filter has one or more "ACCESS_TO" relationships for each refence data type. The darker blue nodes are the ones that are considered a match for the Filter condition.

The challenge is that once a Filter is defined, how can I find all the Doc nodes that match that filter condition. In the diagram, the darker blue reference data nodes are the ones that match the Filter definition. From a path perspective, what I'm looking for are all Doc nodes that...

  1. Can be reached by at least path through the darker blue "Type" nodes.
  2. AND can be reached by at least one path through the darker blue "Geo" nodes.
  3. AND can be reach by at least one path through the darker blue "Subject" nodes.
  4. AND the paths cannot traverse up the hierarchy.

In the diagram above, the query should return only Doc3. Doc1 is not a match due to the Geo tag not being part of the Filter condition. Doc2 is not a match due to the Subject tag not being part of the Filter condition.

This example is a subset of a much larger scenario that has several additional reference data types.

I've got at least a couple of approaches that work but the performance isn't great so hoping someone has a creative approach to the problem that is more performance.

The approaches tried so far are...
Method 1
Get a collection of GUIDs for each of the reference data node related to the filter and doing multiple queries to find the Docs.

Example: (this doesn't include the intial query to get all GUIDs for the filter

MATCH (d:Doc-[:TAGGED]->(t:Type) 
WHERE t.GUID IN ['T02','T03','T04',...]
WITH d
MATCH (d)-[:TAGGED]->(g:Geo) 
WHERE g.GUID IN ['G02','G03','G05',...]
WITH d
MATCH (d)-[:TAGGED]->(s:Subject) 
WHERE s.GUID IN ['S02','S03','S04',...]
RETURN d

Method 2
This uses a path query to find the possible Docs for each path, then uses apoc to get an intersection of the different sets.

Example:

MATCH (f:ClientFilter) WHERE f.GUID="<some ID>"
WITH f
    MATCH (f)-[:ACCESS_TO]->(t:Type)<-[:IS_IN*0..5]-(tc:Type)<-[:TAGGED]-(d:Doc)
WITH f,COLLECT(d) as set1
    MATCH (f)-[:ACCESS_TO]->(g:Geo)<-[:IS_IN*0..5]-(gc:Geo)<-[:TAGGED]-(d:Doc)
WITH f,set1,COLLECT(d) as set2
    MATCH (f)-[:ACCESS_TO]->(s:Subject)<-[:IS_IN*0..5]-(sc:Subject)<-[:TAGGED]-(d:Doc)
WITH f,set1,set2,COLLECT(d) as set3
RETURN apoc.coll.intersection(set1, apoc.coll.intersection(set2,set3)) as Docs

Both of these approaches have similar performance so far, but I'm hoping to find something more efficient. When scaled up to the full scenario, these approaches have response times in the 6-10 second range depend on the filter conditions.

Try this instead:

MATCH (d:Doc)
WHERE EXISTS {
    MATCH (d)-[:TAGGED]->(t:Type) 
    WHERE t.GUID IN ['T02','T03','T04',...]
}
AND EXISTS {
    MATCH (d)-[:TAGGED]->(g:Geo) 
    WHERE g.GUID IN ['G02','G03','G05',...]
}
AND EXISTS {  
    MATCH (d)-[:TAGGED]->(s:Subject) 
    WHERE s.GUID IN ['S02','S03','S04',...]
}
RETURN d

You should also order the predicates so that the group with the smaller number of nodes is evaluated first, if there is a noticeable difference in the cardinality of the Type, Geo, and Subject nodes. This would help the query fail faster if the 'd' node is not fully connected.