How to increase the performance of euclideanDistance query

freakmaxi · March 7, 2024, 10:55pm

Hello Everyone,

I have a question related to the euclideanDistance query. I'm creating a database that contains the information of people faces. Every node contains, a Person information such as;

the unique identity,
the photo id
the face encoding float array (128).

After we add a person node to the database, we are running a query that search the database and set a SIMILARITY relationship with distance information between the nodes.

When we do not have that much person node in the database, the query at the below working very well, however, the database starts getting many person nodes and the query at the below is getting extremely slow. Currently we have 20k people and we have more than 300k people waiting to be added. It was working well till 5K people however currently taking minute(s) to complete the query mentioned below.

The machine is 4 CPU core 8GB RAM but I can make it 16 CPU core 64GB RAM. However, it will be a very expensive monthly setup for this purpose and I'm not sure how much performance benefits it can bring. By the way, it is a standalone Neo4j 5.17.0 setup with apoc and gds plugins. Database size is 8.4GB with 20k node and relationship count is 70M.

The questions are;

Is there a way to increase the speed?
Do you have any other design ideas related to the nodes and relations

The query is;

CALL apoc.periodic.iterate(
    "Match (n:Person), (m: Person) 
           Where not (n)-[:SIMILARITY]-(m) and 
                        n.id_primary = $primary and 
                        n.id_primary <> m.id_primary 
     With n, m, gds.similarity.euclideanDistance(n.encoding, m.encoding) as distance 
           Where distance <= $threshold 
     Return n, m, distance", 
    "Merge (n)-[r:SIMILARITY {distance: distance}]-(m)", 
    {batchSize:100, iterateList:true, parallel:true, params:{primary:"x", threshold:0.45}}
)

primary parameter is the unique identifier for the person. threshold parameters is always the same

I've already created index on id_primary.

To be honest, this kind of query needs to search whole database to create relationships and I have clear awareness about it. However, I'm just curious if there is any point that can answer my questions.

freakmaxi · March 9, 2024, 7:44pm

We resolve the problem by giving up creating relations between nodes.

We created Vector index on the encoding field and execute the search with db.index.vector.queryNodes function and re-calculate the euclidean distance of the highest score matches result. That is not exactly what would like to do with the previous approach but it is providing enough data (query result) for the usability.

Topic		Replies	Views
Euclidean Distance with Neo4j Graph Algorithms/Graph Data Science	1	383	June 24, 2021
Cosine similarity on 1M person nodes Neo4j Graph Platform migrated	5	905	August 22, 2023
Entity resolution at scale Neo4j Graph Platform migrated	4	182	July 7, 2022
How to improve the speed of Cypher query summing the weights of a path Neo4j Graph Platform performance , openstreetmap , cypher	1	383	January 2, 2021
Creating relationship over several millions of nodes Cypher apoc , performance , cypher , relationship	23	2805	September 24, 2020

How to increase the performance of euclideanDistance query

Related topics