How to remove connected components less than x nodes?

Cobra · August 31, 2020, 2:40pm

This query will set a community_id property for each node where the community_id is the size of the network where the node is:

MATCH (a)-[*]-(b)
WITH id(a) AS id, apoc.coll.sort(apoc.coll.toSet(collect(DISTINCT id(b)) + [id(a)])) AS nodes_list
WITH DISTINCT nodes_list, size(nodes_list) AS size
WITH size, apoc.coll.flatten(collect(nodes_list)) AS nodes_list
CALL apoc.periodic.iterate('
    MATCH (n)
    WHERE id(n) IN $nodes_list
    SET n.community_id = $community_id
    ', '
    DETACH DELETE n
    ', {batchSize:1000, params:{nodes_list:nodes_list, community_id:size}}) YIELD batch, operations
RETURN 1

After if you want to delete the connected components that have less than 5 nodes:

CALL apoc.periodic.iterate('MATCH (n) WHERE n.community_id < 5 RETURN n', 'DETACH DELETE n', {batchSize:1000})

Regards,
Cobra

1113 · August 31, 2020, 9:00pm

The query is running since a few hours. I will keep you posted tomorrow ;-)

Thank you very much !

1113 · September 1, 2020, 7:57am

Hi Cobra,

The query is still running this morning. Do we have the possibility to know what % of the job is done ?

Have a great day !

Cobra · September 1, 2020, 8:30am

Hi @1113

I'm confused because it should not be so long
Can you give me the configuration of your database?
How many nodes and relations do you have?
Did you use Hardware Sizing Calculator to choose your database?

With this query, you can get the percentage:

MATCH (a) RETURN toFloat(count(a.community_id)) / toFloat(count(a)) * 100

Regards,
Cobra

1113 · September 1, 2020, 12:10pm

Hi Cobra,

Attached the DB configuration (I used the default config)
Neo4j-conf.txt (36.8 KB)

Regarding the number of nodes/relationships :
5,253,112 nodes (5 labels)
10,260,019 relationships (1 types)

I tried Hardware Sizing Calculator and here's the result :
Recommended System Requirements:

|Number of Cores|1|
|Size on Disk|1.0 GB|

Summary
Number of nodes 5,000,000
Number of relationships 10,000,000
Properties per Node 1
Properties per Relationship 1
Estimated graph size on disk 1.0 GB
Concurrent requests per second 1
Average request time 1 ms

The result of the query to obtain de % is : 0.0 (Strange isn't it ?)

Best regards ! :-)

Cobra · September 1, 2020, 12:22pm

I think, you should increase the RAM of your database

1113 · September 1, 2020, 1:03pm

I increased :

dbms.memory.heap.max_size=4G
and :
dbms.memory.pagecache.size=2G

Makes sense ?

Just relaunched the query with these new parameters

Let's try !

Cobra · September 1, 2020, 1:19pm

Ok, nice

You should try to add a unique property, for example create an id equal to the Neo4j id and put a unique constraint on it. After we could use it in the query and it should be faster.

1113 · September 1, 2020, 1:51pm

I'm not sure to understand. At the moment I have :
Entity:ID,description:LABEL
232ecace75a347258eb690c045322173,Item1
b4ca6276726c4bd9997ca2650b7177b0,Item2

Do you mean I should have someting like :
Entity:ID,description:LABEL,Property:PROPERTY
232ecace75a347258eb690c045322173,Item1,UniqueString1
b4ca6276726c4bd9997ca2650b7177b0,Item2,UniqueString2

If so, Can I use the conacatenation of the ID and the label. So I would have :
Entity:ID,description:LABEL,Property:PROPERTY
232ecace75a347258eb690c045322173,Item1,232ecace75a347258eb690c045322173Item1
b4ca6276726c4bd9997ca2650b7177b0,Item2,b4ca6276726c4bd9997ca2650b7177b0Item2

Cobra · September 1, 2020, 1:53pm

Is Entity:ID unique for each node?

Can you execute CALL db.schema.visualization() on your database and show us the screenshot please?

Can you take a screenshot of your labels and properties on the left please.

1113 · September 1, 2020, 7:27pm

I reimported the data structured as :
Entity:ID,UniqEntity,description:LABEL
e53628fb3f714cbc9eb2546cecc7064c,e53628fb3f714cbc9eb2546cecc7064c,Item1
34c075e8781244bdb933c3539cdf167c,34c075e8781244bdb933c3539cdf167c,Item3

I add added a constraint on UniqEntity :
CREATE CONSTRAINT ON(l:UniqEntity) ASSERT l.id IS UNIQUE
(I hpe the syntax is ok (it seems to be))

And here's the screenshot :

Have a nice evening

Cobra · September 1, 2020, 7:30pm

Test this query:

MATCH (a)-[*]-(b)
WITH id(a) AS id, apoc.coll.sort(apoc.coll.toSet(collect(DISTINCT b.id)) + [a.id])) AS nodes_list
WITH DISTINCT nodes_list, size(nodes_list) AS size
WITH size, apoc.coll.flatten(collect(nodes_list)) AS nodes_list
CALL apoc.periodic.iterate('
    MATCH (n)
    WHERE n.id IN $nodes_list
    SET n.community_id = $community_id
    ', '
    DETACH DELETE n
    ', {batchSize:1000, params:{nodes_list:nodes_list, community_id:size}}) YIELD batch, operations
RETURN 1

After if you want to delete the connected components that have less than 5 nodes:

CALL apoc.periodic.iterate('MATCH (n) WHERE n.community_id < 5 RETURN n', 'DETACH DELETE n', {batchSize:1000})

1113 · September 1, 2020, 10:12pm

I feel like my new import is not correct because I have the folowing message :

Invalid input ')': expected whitespace, '.', node labels, '[', '^', '*', '/', '%', '+', '-', "=~", IN, STARTS, ENDS, CONTAINS, IS, '=', '~', "<>", "!=", '<', '>', "<=", ">=", AND, XOR, OR, AS, ',', ORDER, SKIP, LIMIT, WHERE, FROM GRAPH, USE GRAPH, CONSTRUCT, LOAD CSV, START, MATCH, UNWIND, MERGE, CREATE UNIQUE, CREATE, SET, DELETE, REMOVE, FOREACH, WITH, CALL, RETURN, UNION, ';' or end of input (line 2, column 83 (offset: 100))
"WITH id(a) AS id, apoc.coll.sort(apoc.coll.toSet(collect(DISTINCT b.id)) + [a.id])) AS nodes_list"

What is strange is that I don't have any special charatere the csv file. (except ',' as separator)
So, I'm a bit confused

Cobra · September 2, 2020, 8:59am

Can you tell me what is the difference between your 3 properties (Entity, UniqEntity and id)?

There is a syntax error:

MATCH (a)-[*]-(b)
WITH id(a) AS id, apoc.coll.sortText(apoc.coll.toSet(collect(DISTINCT b.id) + [a.id])) AS nodes_list
WITH DISTINCT nodes_list, size(nodes_list) AS size
WITH size, apoc.coll.flatten(collect(nodes_list)) AS nodes_list
CALL apoc.periodic.iterate('
    MATCH (n)
    WHERE n.id IN $nodes_list
    SET n.community_id = $community_id
    ', '
    DETACH DELETE n
    ', {batchSize:1000, params:{nodes_list:nodes_list, community_id:size}}) YIELD batch, operations
RETURN 1

Regards,
Cobra

1113 · September 2, 2020, 9:17am

Hi Cobra,

Thank you, I just relaunched the Query.

Regarding your question, based on my CVS file :

Entity:ID,UniqEntity,description:LABEL
e53628fb3fy14cbc9eb2546cecc70645,e53628fb3fy14cbc9eb2546cecc70645,Item1

Entity (the ID) is the unique identifier for the Item type.
The Item Type can be found in the "description" field : It can be Item1, Item2, Item3, Item4 or Item5
UniqEntity is just a copy of the Entity Value (ID). just to have a property that is unique. Is that what you expected from me to change in the file format ?

Have a great day !

Cobra · September 2, 2020, 9:27am

Why did you not use Entity:ID for the unique constraint instead of duplicate it?

Have a great day too!

1113 · September 2, 2020, 10:24am

Ha sorry, I didn't understand correctly. So "UniqEntity" is useless, I gonna remove it, and I will put a constraint on ID and relaunch the query.

Can we keep the same query or should it be updated ?

Best regards,

Cobra · September 2, 2020, 10:29am

If the property name is id, you can use this one:

MATCH (a)-[*]-(b)
WITH id(a) AS id, apoc.coll.sortText(apoc.coll.toSet(collect(DISTINCT b.id) + [a.id])) AS nodes_list
WITH DISTINCT nodes_list, size(nodes_list) AS size
WITH size, apoc.coll.flatten(collect(nodes_list)) AS nodes_list
CALL apoc.periodic.iterate('
    MATCH (n)
    WHERE n.id IN $nodes_list
    SET n.community_id = $community_id
    ', '
    DETACH DELETE n
    ', {batchSize:1000, params:{nodes_list:nodes_list, community_id:size}}) YIELD batch, operations
RETURN 1

anthapu · September 2, 2020, 10:54am

This requires more memory than the db size as Cypher needs to keep all these paths into memory. So, it can kick off garbage collection.

Have you tried the weakly connected components algo in GDS?

May be this can help identify the weekly connected components. Once you run you can determine smaller communities and delete them

Cobra · September 2, 2020, 11:09am

Oh I forget this one

Yeah it could work

Topic		Replies	Views
Deleting a subgraph Neo4j Graph Platform migrated	4	159	June 18, 2022
Why apoc.export.cypher.query removes edges from original data? Procedures & APOC	1	200	March 19, 2022
Data Deletion Neo4j Graph Platform migrated	4	231	November 9, 2022
Problems with clustering (GDS) and APOC queries Procedures & APOC apoc , cypher	6	290	February 26, 2022
Delete a subgraph from a database Cypher cypher	10	508	May 3, 2022

July Summer Fun!

How to remove connected components less than x nodes?

Related topics