How to remove connected components less than x nodes?

This query will set a community_id property for each node where the community_id is the size of the network where the node is:

MATCH (a)-[*]-(b)
WITH id(a) AS id, apoc.coll.sort(apoc.coll.toSet(collect(DISTINCT id(b)) + [id(a)])) AS nodes_list
WITH DISTINCT nodes_list, size(nodes_list) AS size
WITH size, apoc.coll.flatten(collect(nodes_list)) AS nodes_list
CALL apoc.periodic.iterate('
    MATCH (n)
    WHERE id(n) IN $nodes_list
    SET n.community_id = $community_id
    ', '
    DETACH DELETE n
    ', {batchSize:1000, params:{nodes_list:nodes_list, community_id:size}}) YIELD batch, operations
RETURN 1

After if you want to delete the connected components that have less than 5 nodes:

CALL apoc.periodic.iterate('MATCH (n) WHERE n.community_id < 5 RETURN n', 'DETACH DELETE n', {batchSize:1000})

Regards,
Cobra

The query is running since a few hours. I will keep you posted tomorrow ;-)

Thank you very much !

Hi Cobra,

The query is still running this morning. Do we have the possibility to know what % of the job is done ?

Have a great day !

Hi @1113 :slight_smile:

I'm confused because it should not be so long :confused:
Can you give me the configuration of your database?
How many nodes and relations do you have?
Did you use Hardware Sizing Calculator to choose your database?

With this query, you can get the percentage:

MATCH (a) RETURN toFloat(count(a.community_id)) / toFloat(count(a)) * 100

Regards,
Cobra

Hi Cobra,

Attached the DB configuration (I used the default config)
Neo4j-conf.txt (36.8 KB)

Regarding the number of nodes/relationships :
5,253,112 nodes (5 labels)
10,260,019 relationships (1 types)

I tried Hardware Sizing Calculator and here's the result :
Recommended System Requirements:

|Number of Cores|1|
|Size on Disk|1.0 GB|

Summary
Number of nodes 5,000,000
Number of relationships 10,000,000
Properties per Node 1
Properties per Relationship 1
Estimated graph size on disk 1.0 GB
Concurrent requests per second 1
Average request time 1 ms

The result of the query to obtain de % is : 0.0 (Strange isn't it ?)

Best regards ! :-)

I think, you should increase the RAM of your database :slight_smile:

I increased :

dbms.memory.heap.max_size=4G
and :
dbms.memory.pagecache.size=2G

Makes sense ?

Just relaunched the query with these new parameters

Let's try ! :slight_smile:

Ok, nice :slight_smile:

You should try to add a unique property, for example create an id equal to the Neo4j id and put a unique constraint on it. After we could use it in the query and it should be faster.

I'm not sure to understand. At the moment I have :
Entity:ID,description:LABEL
232ecace75a347258eb690c045322173,Item1
b4ca6276726c4bd9997ca2650b7177b0,Item2

Do you mean I should have someting like :
Entity:ID,description:LABEL,Property:PROPERTY
232ecace75a347258eb690c045322173,Item1,UniqueString1
b4ca6276726c4bd9997ca2650b7177b0,Item2,UniqueString2

If so, Can I use the conacatenation of the ID and the label. So I would have :
Entity:ID,description:LABEL,Property:PROPERTY
232ecace75a347258eb690c045322173,Item1,232ecace75a347258eb690c045322173Item1
b4ca6276726c4bd9997ca2650b7177b0,Item2,b4ca6276726c4bd9997ca2650b7177b0Item2

Is Entity:ID unique for each node?

Can you execute CALL db.schema.visualization() on your database and show us the screenshot please?

Can you take a screenshot of your labels and properties on the left please.

I reimported the data structured as :
Entity:ID,UniqEntity,description:LABEL
e53628fb3f714cbc9eb2546cecc7064c,e53628fb3f714cbc9eb2546cecc7064c,Item1
34c075e8781244bdb933c3539cdf167c,34c075e8781244bdb933c3539cdf167c,Item3

I add added a constraint on UniqEntity :
CREATE CONSTRAINT ON(l:UniqEntity) ASSERT l.id IS UNIQUE
(I hpe the syntax is ok (it seems to be))

And here's the screenshot :

Have a nice evening :slight_smile:

Test this query:

MATCH (a)-[*]-(b)
WITH id(a) AS id, apoc.coll.sort(apoc.coll.toSet(collect(DISTINCT b.id)) + [a.id])) AS nodes_list
WITH DISTINCT nodes_list, size(nodes_list) AS size
WITH size, apoc.coll.flatten(collect(nodes_list)) AS nodes_list
CALL apoc.periodic.iterate('
    MATCH (n)
    WHERE n.id IN $nodes_list
    SET n.community_id = $community_id
    ', '
    DETACH DELETE n
    ', {batchSize:1000, params:{nodes_list:nodes_list, community_id:size}}) YIELD batch, operations
RETURN 1

After if you want to delete the connected components that have less than 5 nodes:

CALL apoc.periodic.iterate('MATCH (n) WHERE n.community_id < 5 RETURN n', 'DETACH DELETE n', {batchSize:1000})

I feel like my new import is not correct because I have the folowing message :

Invalid input ')': expected whitespace, '.', node labels, '[', '^', '*', '/', '%', '+', '-', "=~", IN, STARTS, ENDS, CONTAINS, IS, '=', '~', "<>", "!=", '<', '>', "<=", ">=", AND, XOR, OR, AS, ',', ORDER, SKIP, LIMIT, WHERE, FROM GRAPH, USE GRAPH, CONSTRUCT, LOAD CSV, START, MATCH, UNWIND, MERGE, CREATE UNIQUE, CREATE, SET, DELETE, REMOVE, FOREACH, WITH, CALL, RETURN, UNION, ';' or end of input (line 2, column 83 (offset: 100))
"WITH id(a) AS id, apoc.coll.sort(apoc.coll.toSet(collect(DISTINCT b.id)) + [a.id])) AS nodes_list"

What is strange is that I don't have any special charatere the csv file. (except ',' as separator)
So, I'm a bit confused :slight_smile:

Can you tell me what is the difference between your 3 properties (Entity, UniqEntity and id)?

There is a syntax error:

MATCH (a)-[*]-(b)
WITH id(a) AS id, apoc.coll.sortText(apoc.coll.toSet(collect(DISTINCT b.id) + [a.id])) AS nodes_list
WITH DISTINCT nodes_list, size(nodes_list) AS size
WITH size, apoc.coll.flatten(collect(nodes_list)) AS nodes_list
CALL apoc.periodic.iterate('
    MATCH (n)
    WHERE n.id IN $nodes_list
    SET n.community_id = $community_id
    ', '
    DETACH DELETE n
    ', {batchSize:1000, params:{nodes_list:nodes_list, community_id:size}}) YIELD batch, operations
RETURN 1

Regards,
Cobra

Hi Cobra,

Thank you, I just relaunched the Query.

Regarding your question, based on my CVS file :

Entity:ID,UniqEntity,description:LABEL
e53628fb3fy14cbc9eb2546cecc70645,e53628fb3fy14cbc9eb2546cecc70645,Item1

  • Entity (the ID) is the unique identifier for the Item type.
  • The Item Type can be found in the "description" field : It can be Item1, Item2, Item3, Item4 or Item5
  • UniqEntity is just a copy of the Entity Value (ID). just to have a property that is unique. Is that what you expected from me to change in the file format ?

Have a great day !

Why did you not use Entity:ID for the unique constraint instead of duplicate it?

Have a great day too!

Ha sorry, I didn't understand correctly. So "UniqEntity" is useless, I gonna remove it, and I will put a constraint on ID and relaunch the query.

Can we keep the same query or should it be updated ?

Best regards,

If the property name is id, you can use this one:

MATCH (a)-[*]-(b)
WITH id(a) AS id, apoc.coll.sortText(apoc.coll.toSet(collect(DISTINCT b.id) + [a.id])) AS nodes_list
WITH DISTINCT nodes_list, size(nodes_list) AS size
WITH size, apoc.coll.flatten(collect(nodes_list)) AS nodes_list
CALL apoc.periodic.iterate('
    MATCH (n)
    WHERE n.id IN $nodes_list
    SET n.community_id = $community_id
    ', '
    DETACH DELETE n
    ', {batchSize:1000, params:{nodes_list:nodes_list, community_id:size}}) YIELD batch, operations
RETURN 1

This requires more memory than the db size as Cypher needs to keep all these paths into memory. So, it can kick off garbage collection.

Have you tried the weakly connected components algo in GDS?

May be this can help identify the weekly connected components. Once you run you can determine smaller communities and delete them

1 Like

Oh I forget this one :see_no_evil:

Yeah it could work :slight_smile: