How to remove connected components less than x nodes?

I would like to remove small networks (connected components) that have less than x nodes.
So, if the network component has x nodes or less, the nodes and the edges that belong to this component will be deleted.

Is that doable ?

Hello @1113 :slight_smile:

Yeah, it looks possible. Can you show us a little example with an image with what you want to keep and what you want to delete?

Regards,
Cobra

Hi,

Thank you for your reply :slight_smile:
What I would like to remove are the nodes/edges in the red area.

Regards,

Can you execute CALL db.schema.visualization() on your database and show us the result please?

I reduce the amount of data so it will probably be more clear.
Below an example of what I would like to do :
In the green circle the Nodes/Edges I would like to keep.
I would like to remove the rest because they are smaller than 6 nodes.

Attached the result of CALL db.schema.visualization()
Schema.txt (1.1 KB)

And The csv files that I used as data source
Edges.txt (1.3 KB) Nodes.txt (869 Bytes)

Thanks in advance !

1 Like

This query should delete for example, connected components that have 10 nodes or less :slight_smile:
You will need the APOC plugin installed on the database.

MATCH (a)-[*]-(b)
WITH id(a) AS id, apoc.coll.sortText(apoc.coll.toSet(collect(DISTINCT b.id) + [a.id])) AS nodes_list
WITH DISTINCT nodes_list, size(nodes_list) AS size
WHERE size <= 10
WITH nodes_list
CALL apoc.periodic.iterate('
    MATCH (n)
    WHERE n.id IN $nodes_list
    RETURN n
    ', '
    DETACH DELETE n
    ', {batchSize:1000, iterateList:true, params:{nodes_list:nodes_list}}) YIELD batch, operations
RETURN 1

Thank you for your reply.
That looks great. I managed to installed apoc, run the query, 1 is returned, so the query seems to be executed with success. But the nodes and edges are still there :

I feel like this vodoo spell needs to be optimized ;-)

Can you tell me the labels of your nodes and their properties?

Sure. Attached my nodes file (no properties, just a label). Is that the cause of the issue ?
Nodes.txt (869 Bytes)

Best regards

I created an id property on my examples, that's why I'm asking.
Try this one, it use the Neo4j id:

MATCH (a)-[*]-(b)
WITH id(a) AS id, apoc.coll.sort(apoc.coll.toSet(collect(DISTINCT id(b)) + [id(a)])) AS nodes_list
WITH DISTINCT nodes_list, size(nodes_list) AS size
WHERE size <= 10
WITH nodes_list
CALL apoc.periodic.iterate('
    MATCH (n)
    WHERE id(n) IN $nodes_list
    RETURN n
    ', '
    DETACH DELETE n
    ', {batchSize:1000, iterateList:true, params:{nodes_list:nodes_list}}) YIELD batch, operations
RETURN 1

Regards,
Cobra

It works like charm.
I will try to understand this query.
Thank you very much ! :slight_smile:

Happy to hear :slight_smile:

Don't hesitate if you have any trouble to understand my query :slight_smile:

Hi,

Thank you for your feedback. I launch the query yesterday on a large database (5 million nodes, and 10 millions Edges) and the process is still running. So, I am not sure this way will fit my need.
What I do with Gephi : There's the possibility to run stats for a specific set of data. By running the stat based on network components, you obtain a component ID for each network. Then, you can filter out the networks that are smaller than a certain size.The pb with Gephi is that he can't manage big data.
Would it be possible to do more or less the same thing with Neo4j : First, obtaining some stats on the data and then filtering out unintersting data ?
Another approach would be to obtain these stats an rather than deleting small networks, make a query to obtain the list of a specific Nodes for all networks greater than a certain size.
I'm not suer I 'm very clear... :slight_smile:
Base on your knoledge what would be the best option to that with a large database ?

Best regards,

Hello @1113 ;)

First, did you use UNIQUE CONSTRAINTS to create your nodes?

Yeah, your way should be also possible on Neo4j, I will try tomorrow :slight_smile:

Regards,
Cobra

Hi Cobra,

I didn't use unique constraints to create the nodes.
I will check the doc to determine how to do that.
Looking forward to get your feedback :slight_smile:

Best regards,

@1113 ;-)

The UNIQUE CONSTRAINT should speed up the query, it's something to have when you work with Neo4j :slight_smile:

Hi, I added the UNIQUE CONSTRAINT on all the entities (all are unique) and relaunched the query.
Let's see :-)
Have a great day !

Can you tell me which property is unique? Like this we can use this one in the query I gave you :slight_smile:
Thanks, you too!

Hi,

In fact I use several Entitities : Item1, Item2, ... they don't have any properties except their Label and they are all unique.

Best regards,

Ok, so we will have to create a community for each size and tag each node with his community in order to delete them but I don't know if it will be faster. In your case, the problem is you don't have a unique property, that's why everything takes time I think :slight_smile: