Variable length path traversal

I am modelling git commits in Neo4j using the community edition (v4.4.11).

A core use-case is to pull the commit history for a particular branch, traversing the (:Commit)-[:PARENT*..]->(:Commit) relationship until there are no parents. Some queries have early stopping criteria (e.g. commit timestamp or encountering a particular commit SHA).

For example:

MATCH (b:Branch)-[:HEAD]->(:Commit)-[:PARENT*0..500]->(c:Commit)
WHERE b.id = $branch_id
AND c.timestamp > datetime($start_date)
RETURN DISTINCT c.sha

I find this query to be very non-performant, there might be several hundred commits in the path (but not thousands). Is there a way to make this a speedy response, some thoughts I had were using APOC functions or maintaining a projection for specific branches. However, I have limited experience with those features.

Any advice is welcome, thank you.

Hi Flex,

There's a new capability introduced on V5.9 names QPP. Can you upgrade to V5 by chance? Otherwise you my need to check on TraversalAPI

I've tried a similar query on version 4.4.11 with a similar dataset (all git commits in the Neo4j repository) and don't find any performance issues. That DISTINCT in the RETURN clause should lead to the VarLengthExpand(Pruning) operator being used, which won't needlessly re-traverse paths it has already seen.

Could you run EXPLAIN on your query and post the plan? It might offer some clues.

Hi @bennu_neo, thank you for your reply! Yes upgrading is a possibility if it will solve the problem. The QPP Cypher addition looks good but seems to be simplifying syntax rather than performance related. Do you mind explaining how this improves performance?

This helped me understand what the problem was. My actual query did a match before the distinct, so I was doing a lot of unnecessary matching.

Thanks.

Hi @finbar.good, I was a little too optimistic when I thought my problem was solved. I believe it was the cache improving performance. After a restart I can't even get the query on the largest repo to finish.

Restricting the depth to 10 still literally took minutes, I can't understand why.

Here's the query and Profile.

I don't know if it's a clue but the query below returned 99 nodes with paths longer than 20. I thought it should only be possible to return paths of length 10.

Again, any advice is very welcome. Thanks.

PROFILE WITH $branch_id AS branch_id
MATCH (branch:Branch {id: branch_id})-[:HEAD]-(:Commit)-[PARENT*0..10]->(commit:Commit)
RETURN DISTINCT commit

It's performing poorly because of a typo in your query. You have written the relationship type PARENT without the colon introducer, which is the same as declaring a variable called PARENT. What that means to the planner is that it needs to check whether one of the relationships in the variable-length portion hasn't already been visited in the -[:HEAD]-> portion of the solution. To do that, it adds this filter:

none(anon_1 IN PARENT WHERE anon_0 = anon_1) AND commit:Commit

That prevents the DISTINCT being pulled up into the VarLengthExpand operator, leading to the exhaustive VarLengthExpand(All) being used.

So try this instead:

PROFILE WITH $branch_id AS branch_id
MATCH (branch:Branch {id: branch_id})-[:HEAD]-(:Commit)-[:PARENT*0..10]->(commit:Commit)
RETURN DISTINCT commit

Having said that, I don't quite know why the number of rows explodes so dramatically when you impose the upper limit of 10 hops to the variable-length relationship. In my data model, each commit has one or two parents with no cycles i.e. PARENT always points back in time. Is that true of your model?

That's embarrassing and explains why it got even worse when I was trying to generate the profile. However, that's not how the code looked in production and when I fix it, it still wont return in a reasonable time for length 50.

Thanks for catching that, I actually hoped it was as simple as that but don't think so.

Attaching new profile based on depth 40.

This query returns 202 nodes, I didn't check the max depth but I'm fairly certain it's more than 21. Also yes - it is impossible for a node to have a parent from the future, so there is no circular paths. There are some merge commits with up to 4 parents but mostly 1 parent and 2 is common.

Thanks again.

PROFILE WITH $branch_id AS branch_id
MATCH (branch:Branch {id: branch_id})-[:HEAD]-(:Commit)-[:PARENT*0..40]->(commit:Commit)
RETURN DISTINCT commit

I also checked the logs while it was attempting the path depth 50 and there were no memory errors or anything logged.

Not sure if it helps but here's the system diagnostics from the top of the log.


                          [ System diagnostics ]                             


                      [ System memory information ]                          

Total Physical memory: 16.00GiB
Free Physical memory: 7.246GiB
Committed virtual memory: 11.18GiB
Total swap space: 0B
Free swap space: 0B


                        [ JVM memory information ]                           

Free memory: 7.968GiB
Total memory: 8.000GiB
Max memory: 8.000GiB
Garbage Collector: G1 Young Generation: [G1 Eden Space, G1 Survivor Space, G1 Old Gen]
Garbage Collector: G1 Old Generation: [G1 Eden Space, G1 Survivor Space, G1 Old Gen]
Memory Pool: CodeHeap 'non-nmethods' (Non-heap memory): committed=2.438MiB, used=1.232MiB, max=5.563MiB, threshold=0B
Memory Pool: Metaspace (Non-heap memory): committed=21.50MiB, used=20.71MiB, max=-1B, threshold=0B
Memory Pool: CodeHeap 'profiled nmethods' (Non-heap memory): committed=5.063MiB, used=5.010MiB, max=117.2MiB, threshold=0B
Memory Pool: Compressed Class Space (Non-heap memory): committed=2.824MiB, used=2.479MiB, max=1.000GiB, threshold=0B
Memory Pool: G1 Eden Space (Heap memory): committed=408.0MiB, used=4.000MiB, max=-1B, threshold=?
Memory Pool: G1 Old Gen (Heap memory): committed=7.578GiB, used=0B, max=8.000GiB, threshold=0B
Memory Pool: G1 Survivor Space (Heap memory): committed=24.00MiB, used=24.00MiB, max=-1B, threshold=?
Memory Pool: CodeHeap 'non-profiled nmethods' (Non-heap memory): committed=2.438MiB, used=1.496MiB, max=117.2MiB, threshold=0B


                    [ Operating system information ]                        

Operating System: Linux; version: 5.10.102.2-microsoft-standard; arch: amd64; cpus: 4
Max number of file descriptors: 1048576
Number of open file descriptors: 176
Process id: 19552
Byte order: LITTLE_ENDIAN
Local timezone: Etc/UTC


                          [ JVM information ]                               

VM Name: OpenJDK 64-Bit Server VM
VM Vendor: Eclipse Adoptium
VM Version: 11.0.16.1+1
JIT compiler: HotSpot 64-Bit Tiered Compilers
VM Arguments: [-Xms8388608k, -Xmx8388608k, -XX:+UseG1GC, -XX:-OmitStackTraceInFastThrow, -XX:+AlwaysPreTouch, -XX:+UnlockExperimentalVMOptions, -XX:+TrustFinalNonStaticFields, -XX:+DisableExplicitGC, -XX:MaxInlineLevel=15, -XX:-UseBiasedLocking, -Djdk.nio.maxCachedBufferSize=262144, -Dio.netty.tryReflectionSetAccessible=true, -Djdk.tls.ephemeralDHKeySize=2048, -Djdk.tls.rejectClientInitiatedRenegotiation=true, -XX:FlightRecorderOptions=stackdepth=256, -XX:+UnlockDiagnosticVMOptions, -XX:+DebugNonSafepoints, -Dlog4j2.disable.jmx=true, -Dfile.encoding=UTF-8]

Your plan looks efficient now. And 202 rows for a depth of 40 should return pretty quickly—doesn't it?

So I am wondering: what do you consider an acceptable time for a depth of 50 to return? How long does it take in practice and how many rows are returned?

Thanks for taking a look at the profile.

It takes minutes to complete. I am looking for sub-second results.

Today I started doing the traversal on our code base, so we pull all commits and their parents in the time window and then traverse from the target head in Javascript. It's taking less than 1 second to process 18 repos with around 5,000 commits in the time period specified.

One learning from doing this was that while a commit cannot have a parent from the future, there are nodes with multiple children and nodes with multiple parents, so when processing we do encounter the same commit multiple times which meant the path traversal didn't finish at first. But now that we will only traverse the parents of a commit once, it finishes very fast. I can only assume the same issue is related to the query on Neo4j (maybe).

I am happy that we have a fix for our application but I'm a bit frustrated that it wasn't handled by the database. Relationship traversal is the reason we picked the graph so would be nice to know if this can be done on the db still.

Thanks again for your input to date.