Variable length greater 11 cause my query to run forever

mirco.zeiss · September 30, 2021, 2:11pm

Hey neo4j community,

I'm using the Neo4j server version 4.3.2 (community).

I'm trying to find all dependencies for a given software package. In this special case I'm working with the Node.js / JavaScript ecosystem and scraped the whole npm registry. My data model is simple, I've got packages and a package can have multiple versions. A version can have multiple dependencies.

In my database I have 113.339.030 dependency relationships and 19.753.269 versions.

My whole code works fine until I found a package that has so many dependencies (direct and transitive) that all my queries break down. It's called react-scripts.

Here you can see the package information.

https://registry.npmjs.org/react-scripts

One visualizer never finishes

and another one creates a dependency graph so big it's hard to analyze.

My nodes have the properties

version_id : integer
name : string
version : string

I'm starting with what I thought would be a simple query but it's already failing. Start with version that has version_id 16674850 and give me all its dependencies.

MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..11]->(b:Version)
return DISTINCT b;

I have an index on version_id .

CREATE INDEX FOR (version:Version) ON (version.version_id)

That works until I set the depth / variable length to 12 or greater. Then the query runs forever. Here is the query plan.

Neo4j runs inside Docker. I've increased some memory settings.

- NEO4J_dbms_memory_heap_initial__size=2G
- NEO4J_dbms_memory_heap_max__size=2G
- NEO4J_dbms_memory_pagecache_size=1G

I uploaded a sample data set. It's the same data I'm currently using. Here are the links

https://s3.amazonaws.com/blog.spolytics.com/versions.csv (737.1 MB)
https://s3.amazonaws.com/blog.spolytics.com/dependencies.csv (1.7 GB)

Here is the script to import the data.

neo4j-admin import \
    --database=deps \
    --skip-bad-relationships \
    --id-type=INTEGER \
    --nodes=Version=import/versions.csv \
    --relationships=DEPENDS_ON=import/dependencies.csv

That might help to do some experiments on your side and to reproduce my problem.

Any ideas? I'm really lost right now and don't want to give up on my "software dependency analysis graph". I spent the last 6 weeks on this problem. Thank you very much!

andrew_bowman · September 30, 2021, 6:55pm

The plan changes to one that is far less efficient. It may be a little tough to try to force the planner not to do that.

As an alternate approach, if you have APOC Procedures installed, please try using apoc.path.subgraphNodes(), as this should be a more efficient approach to matching to distinct nodes when there is a web of paths that can lead to the same nodes.

MATCH (a:Version {version_id: 16674850})
CALL apoc.path.subgraphNodes(a, {relationshipFilter:'DEPENDS_ON>', labelFilter:'>Version', maxLevel:11}) YIELD node as b
RETURN b

mirco.zeiss · September 30, 2021, 8:06pm

Holy moly! Thank you very much. It works! I'm so glad that I finally found a solution.

Started streaming 1725 records after 3 ms and completed after 28 ms, displaying first 1000 rows.

Here is the query plan.

Unfortunately I have to learn more about apoc now

Maybe one additional question. How can I include the path to each node? I'd like to show my users their dependencies and also why this specific dependency is included in the list. Since there are usually multiple paths to a dependency it would be OK to simply return the shortest path.

andrew_bowman · October 1, 2021, 6:48pm

You can use apoc.path.spanningTree() which yields path instead of node, that will have a single path per end node.

mirco.zeiss · October 1, 2021, 8:46pm

Awesome! Thank you very much. If you want you can also grab the bounty at Stack Overflow cypher - Neo4j - Variable length greater 11 runs forever and query never returns - Stack Overflow.

I would be happy to give to you.

andrew_bowman · October 2, 2021, 12:58am

I'll take you up on that! My username there is InverseFalcon, I've posted my answer.

Topic		Replies	Views
Variable length with intermediate node Cypher	2	287	February 27, 2021
Variable length path traversal Cypher performance , cypher	10	1088	September 27, 2023
Optimizing a very simple (but expensive) query that takes hours to complete Cypher	6	1275	June 11, 2020
Query optimization with relationship depth 1 Cypher	3	1634	August 9, 2019
Neo4j CPU 100% for a simple variable-length path traversal Cypher	5	446	July 3, 2024

July Summer Fun!

Variable length greater 11 cause my query to run forever

Related topics