How to increase the performance of euclideanDistance query

Hello Everyone,

I have a question related to the euclideanDistance query. I'm creating a database that contains the information of people faces. Every node contains, a Person information such as;

  • the unique identity,
  • the photo id
  • the face encoding float array (128).

After we add a person node to the database, we are running a query that search the database and set a SIMILARITY relationship with distance information between the nodes.

When we do not have that much person node in the database, the query at the below working very well, however, the database starts getting many person nodes and the query at the below is getting extremely slow. Currently we have 20k people and we have more than 300k people waiting to be added. It was working well till 5K people however currently taking minute(s) to complete the query mentioned below.

The machine is 4 CPU core 8GB RAM but I can make it 16 CPU core 64GB RAM. However, it will be a very expensive monthly setup for this purpose and I'm not sure how much performance benefits it can bring. By the way, it is a standalone Neo4j 5.17.0 setup with apoc and gds plugins. Database size is 8.4GB with 20k node and relationship count is 70M.

The questions are;

  • Is there a way to increase the speed?
  • Do you have any other design ideas related to the nodes and relations

The query is;

CALL apoc.periodic.iterate(
    "Match (n:Person), (m: Person) 
           Where not (n)-[:SIMILARITY]-(m) and 
                        n.id_primary = $primary and 
                        n.id_primary <> m.id_primary 
     With n, m, gds.similarity.euclideanDistance(n.encoding, m.encoding) as distance 
           Where distance <= $threshold 
     Return n, m, distance", 
    "Merge (n)-[r:SIMILARITY {distance: distance}]-(m)", 
    {batchSize:100, iterateList:true, parallel:true, params:{primary:"x", threshold:0.45}}
)

primary parameter is the unique identifier for the person. threshold parameters is always the same

I've already created index on id_primary.

To be honest, this kind of query needs to search whole database to create relationships and I have clear awareness about it. However, I'm just curious if there is any point that can answer my questions.

We resolve the problem by giving up creating relations between nodes.

We created Vector index on the encoding field and execute the search with db.index.vector.queryNodes function and re-calculate the euclidean distance of the highest score matches result. That is not exactly what would like to do with the previous approach but it is providing enough data (query result) for the usability.