Dear all,
I'm running Neo4j (default configuration) on a modern Linux i7 quad-core with 16GB of RAM.
I'm working with a database that stores information about all artifacts (JARs) on Maven Central. The schema of the database is pretty simple: every node is an Artifact
(label). Artifacts have a groupId
, artifactId
and version
(typed String), and are linked by either NEXT
(new version of the artifact) or DEPENDS_ON
(dependency) relations. There are 2,3M artifacts, 2,1M updates links, and 9,3M dependencies.
The indexes are as follows:
INDEX ON :Artifact(groupID),Unnamed index,[Artifact],[groupID],ONLINE,node_label_property,100.0,"{version:1.0,key:lucene}",5,""
INDEX ON :Artifact(coordinates),Unnamed index,[Artifact],[coordinates],ONLINE,node_unique_property,100.0,"{version:1.0,key:lucene}",1,""
INDEX ON :Exception(name),Unnamed index,[Exception],[name],ONLINE,node_unique_property,100.0,"{version:1.0,key:lucene}",3,""
I am running the following query using the Java driver (session.run()
). Its purpose is essentially to find every artifact library1
for which there is two artifacts client1
and client2
such that client1
uses the old version of the artifact library1
and client2
uses the new version of the artifact library2
. After 8+ hours of computing, I still do not get any result streamed, and Neo4j is still working hard.
MATCH (client1)-[:DEPENDS_ON]->(library1)-[:NEXT*]->(library2)<-[:DEPENDS_ON]-(client2)<-[:NEXT]-(client1)
WITH DISTINCT library1
RETURN library1.groupID, library1.artifact
The query plan is as follows:
Finally, a PROFILE
on a simpler query (LIMIT 10000
) gives:
PROFILE
MATCH (client1)-[:DEPENDS_ON]->(library1)-[:NEXT*]->(library2)<-[:DEPENDS_ON]-(client2)<-[:NEXT]-(client1)
WITH DISTINCT library1
RETURN library1.groupID, library1.artifact
LIMIT 10000
Unfortunately, I do not see how to optimize this query. I'm looking for all artifacts that match the pattern, so I do not see how to restrict the number of matched nodes. Besides, the query is so simple that I do not see any room for improvement.
The (library1)-[:NEXT*]->(library2)
is quite important and I cannot put a bound on the number of hops between two versions of a library, there can be any.
Any suggestion on how to improve the computation time of this query is very welcome!