How to remove connected components less than x nodes?

1113 · August 27, 2020, 3:53pm

I would like to remove small networks (connected components) that have less than x nodes.
So, if the network component has x nodes or less, the nodes and the edges that belong to this component will be deleted.

Is that doable ?

Cobra · August 28, 2020, 9:13am

Hello @1113

Yeah, it looks possible. Can you show us a little example with an image with what you want to keep and what you want to delete?

Regards,
Cobra

1113 · August 28, 2020, 12:34pm

Hi,

Thank you for your reply
What I would like to remove are the nodes/edges in the red area.

Regards,

Cobra · August 28, 2020, 12:50pm

Can you execute CALL db.schema.visualization() on your database and show us the result please?

1113 · August 28, 2020, 2:06pm

I reduce the amount of data so it will probably be more clear.
Below an example of what I would like to do :
In the green circle the Nodes/Edges I would like to keep.
I would like to remove the rest because they are smaller than 6 nodes.

Attached the result of CALL db.schema.visualization()
Schema.txt (1.1 KB)

And The csv files that I used as data source
Edges.txt (1.3 KB) Nodes.txt (869 Bytes)

Thanks in advance !

Cobra · August 28, 2020, 3:20pm

This query should delete for example, connected components that have 10 nodes or less
You will need the APOC plugin installed on the database.

MATCH (a)-[*]-(b)
WITH id(a) AS id, apoc.coll.sortText(apoc.coll.toSet(collect(DISTINCT b.id) + [a.id])) AS nodes_list
WITH DISTINCT nodes_list, size(nodes_list) AS size
WHERE size <= 10
WITH nodes_list
CALL apoc.periodic.iterate('
    MATCH (n)
    WHERE n.id IN $nodes_list
    RETURN n
    ', '
    DETACH DELETE n
    ', {batchSize:1000, iterateList:true, params:{nodes_list:nodes_list}}) YIELD batch, operations
RETURN 1

1113 · August 28, 2020, 9:00pm

Thank you for your reply.
That looks great. I managed to installed apoc, run the query, 1 is returned, so the query seems to be executed with success. But the nodes and edges are still there :

I feel like this vodoo spell needs to be optimized ;-)

Cobra · August 28, 2020, 9:31pm

Can you tell me the labels of your nodes and their properties?

1113 · August 29, 2020, 5:01am

Sure. Attached my nodes file (no properties, just a label). Is that the cause of the issue ?
Nodes.txt (869 Bytes)

Best regards

Cobra · August 29, 2020, 8:25am

I created an id property on my examples, that's why I'm asking.
Try this one, it use the Neo4j id:

MATCH (a)-[*]-(b)
WITH id(a) AS id, apoc.coll.sort(apoc.coll.toSet(collect(DISTINCT id(b)) + [id(a)])) AS nodes_list
WITH DISTINCT nodes_list, size(nodes_list) AS size
WHERE size <= 10
WITH nodes_list
CALL apoc.periodic.iterate('
    MATCH (n)
    WHERE id(n) IN $nodes_list
    RETURN n
    ', '
    DETACH DELETE n
    ', {batchSize:1000, iterateList:true, params:{nodes_list:nodes_list}}) YIELD batch, operations
RETURN 1

Regards,
Cobra

1113 · August 29, 2020, 2:03pm

It works like charm.
I will try to understand this query.
Thank you very much !

Cobra · August 29, 2020, 2:26pm

Happy to hear

Don't hesitate if you have any trouble to understand my query

1113 · August 30, 2020, 8:33pm

Hi,

Thank you for your feedback. I launch the query yesterday on a large database (5 million nodes, and 10 millions Edges) and the process is still running. So, I am not sure this way will fit my need.
What I do with Gephi : There's the possibility to run stats for a specific set of data. By running the stat based on network components, you obtain a component ID for each network. Then, you can filter out the networks that are smaller than a certain size.The pb with Gephi is that he can't manage big data.
Would it be possible to do more or less the same thing with Neo4j : First, obtaining some stats on the data and then filtering out unintersting data ?
Another approach would be to obtain these stats an rather than deleting small networks, make a query to obtain the list of a specific Nodes for all networks greater than a certain size.
I'm not suer I 'm very clear...
Base on your knoledge what would be the best option to that with a large database ?

Best regards,

Cobra · August 30, 2020, 8:55pm

Hello @1113 ;)

First, did you use UNIQUE CONSTRAINTS to create your nodes?

Yeah, your way should be also possible on Neo4j, I will try tomorrow

Regards,
Cobra

1113 · August 30, 2020, 9:41pm

Hi Cobra,

I didn't use unique constraints to create the nodes.
I will check the doc to determine how to do that.
Looking forward to get your feedback

Best regards,

@1113 ;-)

Cobra · August 31, 2020, 5:55am

The UNIQUE CONSTRAINT should speed up the query, it's something to have when you work with Neo4j

1113 · August 31, 2020, 6:45am

Hi, I added the UNIQUE CONSTRAINT on all the entities (all are unique) and relaunched the query.
Let's see :-)
Have a great day !

Cobra · August 31, 2020, 6:51am

Can you tell me which property is unique? Like this we can use this one in the query I gave you
Thanks, you too!

1113 · August 31, 2020, 1:58pm

Hi,

In fact I use several Entitities : Item1, Item2, ... they don't have any properties except their Label and they are all unique.

Best regards,

Cobra · August 31, 2020, 2:15pm

Ok, so we will have to create a community for each size and tag each node with his community in order to delete them but I don't know if it will be faster. In your case, the problem is you don't have a unique property, that's why everything takes time I think

Topic		Replies	Views
Deleting a subgraph Neo4j Graph Platform migrated	4	159	June 18, 2022
Why apoc.export.cypher.query removes edges from original data? Procedures & APOC	1	200	March 19, 2022
Data Deletion Neo4j Graph Platform migrated	4	231	November 9, 2022
Problems with clustering (GDS) and APOC queries Procedures & APOC apoc , cypher	6	290	February 26, 2022
Delete a subgraph from a database Cypher cypher	10	508	May 3, 2022

July Summer Fun!

How to remove connected components less than x nodes?

Related topics