Spatial index and bounding box query - slow and lots of db hits

performance
cypher
index

(Tony Panza) #1

I am trying to run a spatial bounding box search with the use of a spatial index.

In my graph, there are approximately 1.4 million nodes with a label of :ItemUsageInstance (there are other nodes beyond those as well). These :ItemUsageInstance nodes have a uniqueness constraint called instance_id. There are also 3 point properties on these node labels that are indexes: volume_min_coord, volume_centroid, volume_max_coord.

What I am trying to do is, given a particular :ItemUsageInstance node, find the closest other :ItemUsageInstances whose volume_centroid is within a certain bounding box of the first ItemUsageInstance. While my query "works", it takes ~12 seconds and has 4359092 db hits. It does not seems like I am fully realizing the benefit of the spatial indexing.

Query so far:

MATCH (i:ItemUsageInstance {instance_id: '000000000123456'})
WITH i, 
i.volume_min_coord.x + 50 AS upper_x, i.volume_max_coord.x - 50 AS lower_x,
i.volume_min_coord.y + 50 AS upper_y, i.volume_max_coord.y - 50 AS lower_y,
i.volume_min_coord.z + 50 AS upper_z, i.volume_max_coord.z - 50 AS lower_z
MATCH (other:ItemUsageInstance)
WHERE id(i) <> id(other)
AND point({x: lower_x, y: lower_y, z: lower_z}) < other.volume_centroid < point({x: upper_x, y: upper_y, z: upper_z})
RETURN distance(i.volume_centroid, other.volume_centroid) AS dist, other.instance_id
ORDER BY dist LIMIT 25;

If I PROFILE the query, here is what I get:

That said, creating the index on the volume_centroid property did seem to provide some modest improvement. I had initially forgotten to do so, and without the volume_centroid index, there were 5805244 total db hits in 12188 ms. After the volume_centroid index was ONLINE, the db hits reduced to: 4359092 total db hits in 12546 ms

Maybe my expectations are not realistic, but it seems like it ought to be doing much better. What can I do to improve the performance?

Result from :schema:

Indexes
   ON :ItemUsageInstance(volume_centroid) ONLINE 
   ON :ItemUsageInstance(volume_max_coord) ONLINE 
   ON :ItemUsageInstance(volume_min_coord) ONLINE 
   ON :ItemUsageInstance(instance_id) ONLINE  (for uniqueness constraint)
   ON :Item(item_id) ONLINE  (for uniqueness constraint)
   ON :Material(material_id) ONLINE  (for uniqueness constraint)

Constraints
   ON ( item:Item ) ASSERT item.item_id IS UNIQUE
   ON ( itemusageinstance:ItemUsageInstance ) ASSERT ItemUsageInstance.instance_id IS UNIQUE
   ON ( material:Material ) ASSERT material.material_id IS UNIQUE

If I drop the LIMIT 25 from the query, there are: 2994 rows available after 12209 ms, consumed after another 8 ms

Version: 3.4.5
Edition: community


(Craig Taverner) #3

Thanks for reporting this. We also recently noticed this issue and made a fix for it, but unfortunately the fix was broader in scope than just the spatial index and as a result will only come out in Neo4j 3.5. This is currently in beta, so you could try and see if it works for your case.

The issue was that the query planner would not plan an index seek if the predicate expression included deeper dependencies, and in the case of the spatial bounding box query this meant it would only plan the index if the max and min expressions were literal point() functions with no dependencies. In your query you are depending on previously defined lower_x etc. This is a reasonable thing to do, of course, and so we have enhanced the planner to deal with this case.

The kinds of queries for which we can now plan the index seek include ones like this one in our acceptance tests: https://github.com/neo4j/neo4j/blob/3.5/enterprise/cypher/acceptance-spec-suite/src/test/scala/org/neo4j/internal/cypher/acceptance/SpatialIndexResultsAcceptanceTest.scala#L706

One curious aspect of this issue is that the index planning for the distance function actually worked a lot better already in Neo4j 3.4. This means that you can still get Cypher to plan the index in Neo4j 3.4 if you model a circle around your bounding box and use the distance function in the predicate instead.

So your options for improving performance are:

  • Try Neo4j 3.5 Beta 02 release which should plan your current query with the spatial index
  • Change the query to use the distance function instead to get Neo4j 3.4 to use the spatial index

For this second option you could calculate a distance threshold as the maximum of the distances between the centroid and the max and min points, and then use that in a predicate like:

distance(i.volume_centroid, other_volume_centroid) < dist_threshold

Note: The fix was only merged recently, and did not make the Neo4j 3.5 Beta01 release, but should be in the Neo4j 3.5 Beta02 release.