Optimizing a very simple (but expensive) query that takes hours to complete

cioran.naroic · August 14, 2019, 4:20pm

Dear all,

I'm running Neo4j (default configuration) on a modern Linux i7 quad-core with 16GB of RAM.

I'm working with a database that stores information about all artifacts (JARs) on Maven Central. The schema of the database is pretty simple: every node is an Artifact (label). Artifacts have a groupId, artifactId and version (typed String), and are linked by either NEXT (new version of the artifact) or DEPENDS_ON (dependency) relations. There are 2,3M artifacts, 2,1M updates links, and 9,3M dependencies.

The indexes are as follows:

INDEX ON :Artifact(groupID),Unnamed index,[Artifact],[groupID],ONLINE,node_label_property,100.0,"{version:1.0,key:lucene}",5,""
INDEX ON :Artifact(coordinates),Unnamed index,[Artifact],[coordinates],ONLINE,node_unique_property,100.0,"{version:1.0,key:lucene}",1,""
INDEX ON :Exception(name),Unnamed index,[Exception],[name],ONLINE,node_unique_property,100.0,"{version:1.0,key:lucene}",3,""

I am running the following query using the Java driver (session.run()). Its purpose is essentially to find every artifact library1 for which there is two artifacts client1 and client2 such that client1 uses the old version of the artifact library1 and client2 uses the new version of the artifact library2. After 8+ hours of computing, I still do not get any result streamed, and Neo4j is still working hard.

MATCH (client1)-[:DEPENDS_ON]->(library1)-[:NEXT*]->(library2)<-[:DEPENDS_ON]-(client2)<-[:NEXT]-(client1)
WITH DISTINCT library1
RETURN library1.groupID, library1.artifact

The query plan is as follows:

Finally, a PROFILE on a simpler query (LIMIT 10000) gives:

PROFILE
MATCH (client1)-[:DEPENDS_ON]->(library1)-[:NEXT*]->(library2)<-[:DEPENDS_ON]-(client2)<-[:NEXT]-(client1)
WITH DISTINCT library1
RETURN library1.groupID, library1.artifact
LIMIT 10000

Unfortunately, I do not see how to optimize this query. I'm looking for all artifacts that match the pattern, so I do not see how to restrict the number of matched nodes. Besides, the query is so simple that I do not see any room for improvement.

The (library1)-[:NEXT*]->(library2) is quite important and I cannot put a bound on the number of hops between two versions of a library, there can be any.

Any suggestion on how to improve the computation time of this query is very welcome!

soham.dhodapkar · August 14, 2019, 5:58pm

Hi,
Good that you profiled the query, that really helps.
Maybe try matching one pattern with distinct library1 first, and then the rest of the pattern
As in
MATCH (client1)-[:DEPENDS_ON]->(library1)
WITH DISTINCT library1 as lib1
..rest of the query..
This will pick distinct library1 first, and then match them with the rest of the pattern reducing the number of patterns to look for (lesser number of expansions).

michael.hunger · August 14, 2019, 7:22pm

Do you have by chance the data available for testing?

I guess you'd want to start from the back and then use a where condition with a shortest path

PROFILE
MATCH (library2)<-[:DEPENDS_ON]-(client2)<-[:NEXT]-(client1)
WITH distinct client1, library2
MATCH (client1)-[:DEPENDS_ON]->(library1)
WHERE library1 <> library2 AND shortestPath( (library1)-[:NEXT*]->(library2) ) is not null
WITH DISTINCT library1
RETURN library1.groupID, library1.artifact
LIMIT 10000

cioran.naroic · August 14, 2019, 8:24pm

The data is available here: https://zenodo.org/record/1489120/files/maven-data.raw.tar.xz. Would be awesome if you could have a look!

@Soham: your trick helps a little but it seems I'm paying similarly the same price again after the first DISTINCT.

@Michael: I see your idea, and indeed it has a nice impact on db hits. But the query takes more time in the end: 21034924 total db hits in 13528 ms (original) vs 3115460 total db hits in 48137 ms (yours).

It seems that I can workaround my issue by removing all DISTINCT and do the job on the Java side. This way, the query is pretty fast to start streaming the result, and I can check for duplicates on my side. When I add the DISTINCT back, I just don't get output in hours. But that doesn't solve the issue "properly".

andrew_bowman · August 14, 2019, 9:35pm

Let's try an alternate approach here. You're only interested in one node within this pattern. We can use a collection (from a pattern comprehension) to gather a set of interested nodes for one of these other nodes in the pattern, and then use some more efficient means to determine if there's at least one var-length relationship connecting your library node to a node in the collection.

The main idea is that we're aiming to cut down on cardinality and do more efficient expansion.

You'll want to leverage APOC Procedures for this.

Give this a try:

PROFILE
MATCH (library)
WHERE ()-[:DEPENDS_ON]->(library) AND (library)-[:NEXT]->()  // this will use fast relationship degree checks
WITH library, [(library2)<-[:DEPENDS_ON]-(client2)<-[:NEXT]-(client1)-[:DEPENDS_ON]->(library) WHERE library2 <> library | library2] as otherLibraries
WITH library, apoc.coll.toSet(otherLibraries) as otherLibraries // make the list distinct
CALL apoc.path.subgraphNodes(library, {relationshipFilter:'NEXT>', endNodes:otherLibraries, limit:1}) YIELD node
WITH library
LIMIT 10000
RETURN library.groupID, library.artifact

cioran.naroic · August 15, 2019, 8:52am

@andrew_bowman Thank you for the suggestion! Just like Michael's query, your solution reduces the number of db hits but takes more time to complete: 21034924 total db hits in 14173 ms (original) vs. 12908177 total db hits in 24683 ms (yours). What are the factors, besides db hits, that might explain these results?

@all: I have to apologize. The reason I did not get any result streamed was external to Neo4j—a deadlock in my own code... The query is still relatively slow, but this is okay for my use case.

Out of curiosity, I'd still be interested in optimizing the query; but otherwise you can consider my problem solved. Sorry, and thank you all!

andrew_bowman · June 11, 2020, 7:02pm

One factor could be the lack of labels. You have MATCH (library), which means this will attempt a scan on all nodes in your db. If you only want to check this for nodes of a certain label, then add the label to the pattern and it will only scan nodes of that label, which may be more efficient.

Topic		Replies	Views
Optimisation of a cypher query Cypher optimization	3	1083	January 29, 2019
Query all the paths to one node with one kind of relationship neo4j Cypher cypher	9	4525	September 9, 2020
How to get query plan and how to create and use an index on id? Cypher	10	485	August 4, 2020
Variable length greater 11 cause my query to run forever Cypher	5	300	October 2, 2021
Cypher query execution is not effectively Cypher performance , cypher	2	320	July 27, 2021

July Summer Fun!

Optimizing a very simple (but expensive) query that takes hours to complete

Related topics