Slow cypher

Hi there!
i am new to neo4j and having hard time counting the path between two nodes.
i want to count the path between A and Z where the frequency of each relationship is greater than two.
i tried this code but it is not returning anything; it keeps running and not yield and result, it runs indefinitely.

match (n:Concept),(m:Concept)
where toUpper (n.name) contains "TUBERCULOSIS"
AND toUpper (m.name) contains "CHEST PAIN"
with n,m
match path=(n)-[:CAUSES*]->(m)
where all (a in relationships(path) where a.freq>2)
RETURN count(distinct (path)) as distinct_paths

I will be grateful for your help.
side note: the graph is large with 350 million nodes and 450 million relationships.
Also, anything higher than [:CAUSES*1..2] runs indefinitely and no result is shown.

A few questions for you to better analyze your issue :

  • Do you have indexes on Concept.name ? Important : using a toUpper() will negate the use of an index, as indexing is case sensitive. Consider using a full-text index, or storing your string in the format that you'll want to query it.
  • Do you have the possibility to use STARTS WITH instead of CONTAINS for your string filter ?
  • Can you execute the same query with a "PROFILE" added before the query (or EXPLAIN if your query is really too long or never returns) and provide a screenshot of the returned execution graph ?

Other topic : your query builds a cartesian product on the first row, meaning that if you get n Tuberculosis nodes, and m ChestPain nodes, you are passing along n*m nodes to the rest of the execution.
Consider writing it that way :

MATCH path=(n:Concept)-[:CAUSES*]->(m:Concept)
WHERE n.name CONTAINS "xxx" AND m.name CONTAINS "yyy" AND all(a in relationships(path) where a.freq>2)

thank you for you response.
no there are no indexes,
if i use
(n)-[:causes]->(n1)-[:causes]->(n3)
i get a return count,
but if i do
**(n)-[:causes]->(n1)-[:causes]->(n3)-[:causes]->(n4), it runs indefinitely
as far as starts with, it still has the same issue.
writing it the way you suggested return value of zero even for the positive examples that i know for sure are not zero. Ithink that the names are a mix of upper and lower cases and for that reason i used toUpper, to eliminate that issue.
keep in mind that the node label is Concept and property key is name.


this is how it runs forever.
Screenshot from 2021-04-20 11-32-16

You have several potential issues here.

  • The first one was, filtering on a non-indexed property (name for Concept label). I understand that this property is not normalized (it could be Tuberculosis, as well as TUBERCULOSIS depending on nodes). For this, I would suggest storing it in a normalized way, for instance toLower. This way, you can index it and then query it with the same norm (all lowercase).
  • The cartesian product, which is fixed by moving the filtering on n & m inside the path declaration instead of doing a MATCH separated by a comma

These two issues do not relate to the issue you have with your 2+-length paths, but you should still consider fixing them as it will become an issue sooner or later.

Now, for your issue with paths. Could you execute the following query and provide back the execution plan (see example picture below the query) ? This will help understand what is happening.
Please note that I left the toUpper(m.name) even though you should change that to benefit from indexing. I'll just leave that for now so you don't have to make changes to your property and indexes right now.

EXPLAIN
MATCH path=(n:Concept)-[:CAUSES*]->(m:Concept)
WHERE toUpper(n.name) CONTAINS "TUBERCULOSIS" AND toUpper(m.name) CONTAINS "CHEST PAIN" AND all(a in relationships(path) where a.freq>2)
RETURN count(distinct (path)) as distinct_paths

This is the plan

You should enforce the indexes, i.e. store your indexed names as uppercase so you don't need to do the function and it uses the index, and can start from both sides for the variable path length.
it's also better to use concrete matching entries which is faster than contains (or use medical identifiers instead of names)

Adding a max limit to the path will also help

Unfortunately accessing and checking rel-properties on a lot of relationships takes time.
One thing you can do in your model is to create those relationships then your filter will be faster:

MATCH (a:Concept)-[:CAUSES]->(b:Concept) WHERE r.freq > 2
CREATE (a)-[:FREQUENT]->(b)

and then use that FREQUENT relationship in your path query.

match path=(n:Concept)-[:FREQUENT*]->(m:Concept)
where n.name contains "TUBERCULOSIS" AND m.name contains "CHEST PAIN"
RETURN count(*) as distinct_paths

Thank you very much for your help, I am still having difficulties with it.
it is still not enough to traverse the paths
hardware wise, I am using 64GB RAM and ryzon 16-core CPU, I though it would help to move away from intel but I am still having trouble with it.

With neo4j 4.3 you can create a relationship property index which should speed up your query quite a bit.

https://neo4j.com/docs/cypher-manual/current/administration/indexes-for-search-performance/#administration-indexes-create-a-composite-index-for-relationships