Following my previous question I'm looking for a possibility to rewrite an apoc query using pure cypher.
At this stage my main query looks like that :
MATCH (a: A {id: $id})
CALL apoc.path.subgraphNodes(
a,
{
filterStartNode: true,
maxLevel: -1,
labelFilter: ">A|L1|L2|L3|L4|-Excluded",
relationshipFilter: "USED_L1|USED_L2|USED_L3|USED_L4",
limit: 15000
}
)
YIELD node
RETURN node.id
This query works pretty good, returns a response in milliseconds. However there is a huge limitation when if I for example I want to filter by relationship property.
So I started to rewrite it using cypher and get a more or less working version, which supports the filtering by properties, however I have huge problems with the performance of the query.
MATCH (a:A {id: 12})
MATCH (a)
(
(a_i:A&!Excluded)-
[r_in:USED_L1|USED_L2|USED_L3|USED_L4
WHERE r_in.excluded is null or r_in.excluded <> true]
->(n:(L1&!Excluded)|(L2&!Excluded)|(L3&!Excluded)|(L4&!Excluded))
<-[r_out:USED_L1|USED_L2|USED_L3|USED_L4
WHERE r_out.excluded is null or r_out.excluded <> true]
- (a_j:A&!Excluded)
)* (a_end:A&!Excluded)
return distinct a_end.id limit 15000;
This query resolve all the limitations with filtering I have when use apoc, but it has a very bad performance. It works if I limit the hops to 3 with the medium size clusters, but starting from the hop 4 it's becomes unusable.
I was trying to play with WHERE COUNT {} syntax to reduce duplications but it looks like it make things even worse in the current case.
My data model looks like that:
Each node A is related to the nodes L(1-4) and then related to other nodes with labels A by one or more nodes L(1-4). For example:
Aa --> L1 <-- Ab --> L2 <-- Ac
\-> L2 <-/ \-> L3 < -/
The main goal of the query is to get all the nodes A connected to current node A.
You got a lot going on here. The property predicates will impact the execution since they have to be retrieved. Label and type predicates will be faster. How about you use a label like A_Include instead of Exclude, so your label predicate A&!Exclude can be replaced with A_Include. You can do the same for other labels that you want to include/exclude. The same is true for the relationships types. For example, USED_L1 and USED_L1_EXCLUDE. This way you can eliminate the relationship property excluded and those corresponding predicates.
I do wonder about your data model though. It makes your queries complicated.
@glilienfield , Thanks for your answer.
yes, it's a possible approach. If I use the relationships like USED_L1 and USED_L1_EXCLUDE I can continue to use the apoc.path.subgraphNodes function, and so it works pretty efficient with the existing labels system.
So am I right, that I shouldn't design the data model to filter graph traversal based on the relationships properties as it will be inefficient (I had another use case for the future to filter based on the timestamp on the relationship, so it looks like it won't be possible).
I can't share completely my data model, but the process it two words is like so:
There is a event that creates a node A in the database
It checks if nodes L1, L2, L3 and so on already in database
If they are, it creates relationships between the newly added A and L1..n nodes.
Otherwise it creates L1..n nodes and creates the relationships
same process happens all the time
Another process executes queries to get all the nodes A related to the just added node A to execute some logic based on the received cluster.
However there could be false positive relationships between An and L1..4, and I want to exclude them from the clusters. So far it's made based on the node only where we have (L1-4:Excluded), but I'd want to be able to make an exclusion based on the relationships (as the relation could be false positive for A1 but true positive for An). Finally I don't want to completely remove the relationship, to keep an initial image, and only mark it as excluded.