I have an interesting scenario that I would appreciate some feedback on. The general issue is the following: Given a starting node, how can I find a set of endpoints which can be reached by a set of paths, when each path must go through a specific set of other nodes. I'm using Neo4J v4.4 currently.
The diagram has the general setup as follows:
- Reference data nodes (in blue, with labels "Type", "Geo" and "Subject" are each a hierarchy). For example in Geo, the top level would be the entire world, with a child node of "North America", with a child node of "United States", etc.
- Documents are tagged with one or more reference data nodes for each of the reference data types.
- A Filter condition is applied which defined content that a user has access to. Each filter has one or more "ACCESS_TO" relationships for each refence data type. The darker blue nodes are the ones that are considered a match for the Filter condition.
The challenge is that once a Filter is defined, how can I find all the Doc nodes that match that filter condition. In the diagram, the darker blue reference data nodes are the ones that match the Filter definition. From a path perspective, what I'm looking for are all Doc nodes that...
- Can be reached by at least path through the darker blue "Type" nodes.
- AND can be reached by at least one path through the darker blue "Geo" nodes.
- AND can be reach by at least one path through the darker blue "Subject" nodes.
- AND the paths cannot traverse up the hierarchy.
In the diagram above, the query should return only Doc3. Doc1 is not a match due to the Geo tag not being part of the Filter condition. Doc2 is not a match due to the Subject tag not being part of the Filter condition.
This example is a subset of a much larger scenario that has several additional reference data types.
I've got at least a couple of approaches that work but the performance isn't great so hoping someone has a creative approach to the problem that is more performance.
The approaches tried so far are...
Method 1
Get a collection of GUIDs for each of the reference data node related to the filter and doing multiple queries to find the Docs.
Example: (this doesn't include the intial query to get all GUIDs for the filter
MATCH (d:Doc-[:TAGGED]->(t:Type)
WHERE t.GUID IN ['T02','T03','T04',...]
WITH d
MATCH (d)-[:TAGGED]->(g:Geo)
WHERE g.GUID IN ['G02','G03','G05',...]
WITH d
MATCH (d)-[:TAGGED]->(s:Subject)
WHERE s.GUID IN ['S02','S03','S04',...]
RETURN d
Method 2
This uses a path query to find the possible Docs for each path, then uses apoc to get an intersection of the different sets.
Example:
MATCH (f:ClientFilter) WHERE f.GUID="<some ID>"
WITH f
MATCH (f)-[:ACCESS_TO]->(t:Type)<-[:IS_IN*0..5]-(tc:Type)<-[:TAGGED]-(d:Doc)
WITH f,COLLECT(d) as set1
MATCH (f)-[:ACCESS_TO]->(g:Geo)<-[:IS_IN*0..5]-(gc:Geo)<-[:TAGGED]-(d:Doc)
WITH f,set1,COLLECT(d) as set2
MATCH (f)-[:ACCESS_TO]->(s:Subject)<-[:IS_IN*0..5]-(sc:Subject)<-[:TAGGED]-(d:Doc)
WITH f,set1,set2,COLLECT(d) as set3
RETURN apoc.coll.intersection(set1, apoc.coll.intersection(set2,set3)) as Docs
Both of these approaches have similar performance so far, but I'm hoping to find something more efficient. When scaled up to the full scenario, these approaches have response times in the 6-10 second range depend on the filter conditions.