Hi there,
I have a large graph and want to traverse a path between two nodes;however, i am having some issues.
when i use
match path= ({cui:"C0032373"})-[*1..5]->({cui:"C0032371"})
RETURN count(DISTINCT(path)) as number_of_paths, length(path) AS hops
ORDER BY hops;
i get
╒═════════════════╤══════╕
│"number_of_paths"│"hops"│
╞═════════════════╪══════╡
│1 │1 │
├─────────────────┼──────┤
│1 │4 │
├─────────────────┼──────┤
│5 │5 │
└─────────────────┴──────┘
However, If I want to to have 20 as the number of hops, the graph runs for hours with out giving back any results. So, i tried to use path-expansion procedures with apoc but i only get one path with one hop.
match (n), (m)
where n.cui="C0032373" and m.cui="C0032371"
with n, collect(m) as terminatorNodes
CALL apoc.path.spanningTree(n, {
relationshipFilter: "CAUSES_F5>",
minLevel: 1,
maxLevel: 100,
terminatorNodes:terminatorNodes
})
YIELD path
RETURN count(DISTINCT(path)), length(path) AS hops
ORDER BY hops;
╒═══════════════════════╤══════╕
│"count(DISTINCT(path))"│"hops"│
╞═══════════════════════╪══════╡
│1 │1 │
└───────────────────────┴──────┘
How can i count all the possible paths between two nodes and even return the intermediate nodes. i have a pc running 64G RAM and AMD-Ryzon 9 3950X
I would appreciate your input.
Getting and counting all paths between two nodes can be done with the predicate ALL. This StackOverflow post shows a basic example on how to do it.
That being said, you should consider the size of your graph and whether it might make sense to trim down the results a bit. When you said that using 20 hops never finishes running, I am wondering about the size of your graph? How many nodes and relationships do you have? What is the average number of relationships between nodes?
If you have a densely connected graph, you might have a whole lot of paths that have to be considered that involve only very minor changes to the intermediate nodes. So it wouldn't surprise me in that case if your machine got bogged down. In general, you might consider using a solution like this one described here, but you will node that they use [*], which could be problematic in the case where your nodes are all densely connected.
So please let us know some of the stats on your graph and we can go from there.
thank you very much for your response.
using [*] goes through cyclic paths, if the starting nodes is A and the end-node is G,
i would get something like A-G-H-S-G and A-G-M-N-R-L-G. as far as i am concerned these two are a single path because i am interested in a path which terminates at the first G.
as far as Graph size, it had four hundred and fifty Million relationships-72 Types, and it has three hundreds and fifteen million nodes with 50+ node-labels. please take a look a the photo i attached.
thanks for the great advises
i also want to say that i have trimmed it to two hundred thousand nodes and 19 million relationships. trimming it any further might eliminate useful nodes and relationships.
the graph i am working with is a biomedical graph - meaning it is dense by nature.
You can't use apoc.path.spanningTree() for this, as that uses NODE_GLOBAL uniqueness, meaning once visited, a node will not be able to be revisted via a different path. This is why you only got a single one-hop result: once the single-hop path to G was discovered, G has been visited, so no alternate paths to G will be found. This type of uniqueness finds the shortest path to each node encountered.
You will need to change to using the apoc.path.expandConfig() procedure instead. The signatures are the same, so only a single change is needed, an additional config property to set the uniqueness:
NODE_PATH uniqueness means that per-path, nodes cannot repeat. That prevents looping in a path.
With a complex enough graph you may still find this taking a very long time due to the sheer number of possible paths of up to 100 depth that don't loop and don't encounter the terminator nodes. That's a lot of paths to expand.
Also if possible please label n and m, and create indexes on their cui property to speed up those lookups. How many n and how many m nodes are you expecting from that initial MATCH?
thank you very much. however, there is a small detail which is how to return a value of zero if the paths do not exist?
you will notice that for hops 1 and 2, there are no returns as they do not exist. How can include all specified hops and their values; be it zero or not?
The query so far only works with the length of the path, there's nothing there to tell it to consider values others other than the lengths found, or that there exists some kind of range of values to consider.
You would need to collect the info you need as a structure while getting the max hops, then UNWIND the range (so you have 1 row per entry in the range), and do a little bit of work to get the info you need from the structure that corresponds to it.
...
// assume you've done your work and you have the following two variables
WITH number_of_hops, paths
WITH collect ({number_of_hops:number_of_hops, paths:paths}) as rows, max(number_of_hops) as maxHops
UNWIND range(1, maxHops) as number_of_hops
RETURN number_of_hops, [row IN rows WHERE row.number_of_hops = number_of_hops][0].paths as paths
Thank you,
To get the paths and number_of_hops, I must return them first. I define the two variables during the RETURN stage. Adding the code you sent yields..." RETURN can only be used at the end of the query"; I end up having two return commands. If I split the code, then the result is " variables not defined."