Hey neo4j community,
I'm using the Neo4j server version 4.3.2 (community).
I'm trying to find all dependencies for a given software package. In this special case I'm working with the Node.js / JavaScript ecosystem and scraped the whole npm registry. My data model is simple, I've got packages and a package can have multiple versions. A version can have multiple dependencies.
In my database I have 113.339.030 dependency relationships and 19.753.269 versions.
My whole code works fine until I found a package that has so many dependencies (direct and transitive) that all my queries break down. It's called react-scripts
.
Here you can see the package information.
https://registry.npmjs.org/react-scripts
One visualizer never finishes
and another one creates a dependency graph so big it's hard to analyze.
My nodes have the properties
-
version_id
: integer -
name
: string -
version
: string
I'm starting with what I thought would be a simple query but it's already failing. Start with version that has version_id
16674850
and give me all its dependencies.
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..11]->(b:Version)
return DISTINCT b;
I have an index on version_id
.
CREATE INDEX FOR (version:Version) ON (version.version_id)
That works until I set the depth / variable length to 12
or greater. Then the query runs forever. Here is the query plan.
Neo4j runs inside Docker. I've increased some memory settings.
- NEO4J_dbms_memory_heap_initial__size=2G
- NEO4J_dbms_memory_heap_max__size=2G
- NEO4J_dbms_memory_pagecache_size=1G
I uploaded a sample data set. It's the same data I'm currently using. Here are the links
- https://s3.amazonaws.com/blog.spolytics.com/versions.csv (737.1 MB)
- https://s3.amazonaws.com/blog.spolytics.com/dependencies.csv (1.7 GB)
Here is the script to import the data.
neo4j-admin import \
--database=deps \
--skip-bad-relationships \
--id-type=INTEGER \
--nodes=Version=import/versions.csv \
--relationships=DEPENDS_ON=import/dependencies.csv
That might help to do some experiments on your side and to reproduce my problem.
Any ideas? I'm really lost right now and don't want to give up on my "software dependency analysis graph". I spent the last 6 weeks on this problem. Thank you very much!