Closest point is slow for millions of point nodes

I'm trying to find the closest point to another point in Neo4j V5 enterprise. The obvious answer is to use the point.distance function, however this doesn't perform how I would like it to.

I'll explain: Using the point.distance function to run a query something like this:

  PROFILE WITH point({longitude:-5.9, latitude:40.7}) AS poi
  MATCH (n:LocationNodes)
  WITH n, point.distance(poi, n.point) as distance
    ORDER BY distance
    LIMIT 1
  RETURN n

However if I look at the execution plan for this query, it calculates the distance from my poi to ALL LocationNodes in my database. If I have 10 million poi nodes then this is a lot of DB hits and isn't performant.

I'm really looking for the index to be used and the index gets the next closest point. I was hoping the index would store points by distance or something like that.

My test example is as follows:

CREATE POINT INDEX LocationNodes_point FOR (node:LocationNodes) ON (node.point);

CREATE (:LocationNodes {point:point({longitude:- 6.1, latitude:40.7})})
CREATE (:LocationNodes {point:point({longitude:- 6.2, latitude:40.8})})
CREATE (:LocationNodes {point:point({longitude:- 6.4, latitude:40.6})})
CREATE (:LocationNodes {point:point({longitude:- 5.9, latitude:40.7})})
CREATE (:LocationNodes {point:point({longitude:- 5.8, latitude:40.8})})
CREATE (:LocationNodes {point:point({longitude:- 5.5, latitude:40.6})})
CREATE (:LocationNodes {point:point({longitude:- 4.1, latitude:40.7})})
CREATE (:LocationNodes {point:point({longitude:- 4.4, latitude:40.8})})
CREATE (:LocationNodes {point:point({longitude:- 4.7, latitude:40.6})})

My above distance query searches all 9 nodes calculating the distance for each. Total DB hits is 30. This is for only 9 point nodes. If I have millions then this is going to be a performance problem.

In contrast if I run following query

PROFILE WITH point({longitude:-6.10, latitude:40.81}) AS poi
MATCH (n:LocationNodes) WHERE point.distance(poi, n.point) <10000
RETURN n 

This hits the index and results in a DB hit of only 4. If I increase the number of nodes in the system my understanding is this will remain at 4 DB hits as long as there is only 1 point in the 10000 meter radius. This is exactly what I would expect. This is a defined number of DB hits for any number of nodes. And this is what I want when calculating the closest node.

Here the index is used to calculate the 1 node within the 10000meter distance. The index obviously knows distance between nodes. How can I use the index to get the closest point? Or is these another way to find the closest point node?

I am real curious how the index can be used to find points within in a box or distance, as the distance between all the other indexed points and the target point changes for each target point.

Anyways, given that the index will be used when the predicate is distance is less than a value, could you trigger the use of the index and limit the number of points searched by using a max distance predicate based on your knowledge of your data set? Can you define a distance value, so that you know at least one point will be within that distance for every point in your data set? If so, set that value to the maxDistance in the following query.

WITH point({longitude:-5.9, latitude:40.7}) AS poi, 1000 as maxDistance
MATCH (n:LocationNodes)
WITH n, point.distance(poi, n.point) as distance
WHERE distance < maxDistance
RETURN n
    ORDER BY distance
    LIMIT 1

In case the above does not recognize the predicate and use the index, maybe try the following refactored form:

WITH point({longitude:-5.9, latitude:40.7}) AS poi, 1000 as maxDistance
MATCH (n:LocationNodes)
WHERE point.distance(poi, n.point) < maxDistance
RETURN n
    ORDER BY point.distance(poi, n.point)
    LIMIT 1

Richard,

You can first filter the nodes by a bounding box to narrow down the set of nodes to compare distances.

Then calculate the distance on the nodes within the bounding box to your start point.

WITH
  point({longitude: -4.53, latitude: -55.66}) AS lowerLeft,
  point({longitude: 1.1, latitude: 55.70}) AS upperRight,
 point({longitude:-6.10, latitude:40.81}) AS poi
MATCH (t:LocationNodes)
WHERE point.withinBBox(t.point, lowerLeft, upperRight)
with t, poi, point.distance(poi, t.point) as distance
return id(t), distance;

Or this:

WITH point({longitude:-6.10, latitude:40.81}) AS poi
MATCH (n:LocationNodes) WHERE point.distance(poi, n.point) <30000
return n, point.distance(poi, n.point) as pt
order by pt asc limit 1 

Hi Gary, thanks however both these queries will potentially hit the DB with a large number of 'distance' calculations which is what I was trying to avoid.
One of the key points here is I don't have or know a max distance. The closest point could be 10 meters away or 100,000 meters away. If I set my max distance to 100,000 then the query will perform the point.distance calculation for all the nodes inside the 100,000 range, potentially millions. When all I really want to do is calculate for the closest node, one node really.

I was looking for solution where I could start with a max distance of say 1,000 meters and 'grow' the max distance if I get zero back from the query. Maybe an apoc while loop or something stopping when I get one result or more.

yes, the bounding box was another thing I was looking at, however again (see reply to Gary) I still need to know the size of the bounding box. With the points I have the closest could be meters away or 1,000's of kilometres away. I really have no idea how close the closest point could be. This makes it difficult.

Unfortunately, I don't know of an APOC procedure that allows the equivalent of a 'while' loop. Cypher does not have any scripting capabilities either.

You could write a simple custom procedure.