I would like to remove small networks (connected components) that have less than x nodes.
So, if the network component has x nodes or less, the nodes and the edges that belong to this component will be deleted.
Is that doable ?
I would like to remove small networks (connected components) that have less than x nodes.
So, if the network component has x nodes or less, the nodes and the edges that belong to this component will be deleted.
Is that doable ?
Hello @1113
Yeah, it looks possible. Can you show us a little example with an image with what you want to keep and what you want to delete?
Regards,
Cobra
Hi,
Thank you for your reply
What I would like to remove are the nodes/edges in the red area.
Regards,
Can you execute CALL db.schema.visualization()
on your database and show us the result please?
I reduce the amount of data so it will probably be more clear.
Below an example of what I would like to do :
In the green circle the Nodes/Edges I would like to keep.
I would like to remove the rest because they are smaller than 6 nodes.
Attached the result of CALL db.schema.visualization()
Schema.txt (1.1 KB)
And The csv files that I used as data source
Edges.txt (1.3 KB) Nodes.txt (869 Bytes)
Thanks in advance !
This query should delete for example, connected components that have 10 nodes or less
You will need the APOC plugin installed on the database.
MATCH (a)-[*]-(b)
WITH id(a) AS id, apoc.coll.sortText(apoc.coll.toSet(collect(DISTINCT b.id) + [a.id])) AS nodes_list
WITH DISTINCT nodes_list, size(nodes_list) AS size
WHERE size <= 10
WITH nodes_list
CALL apoc.periodic.iterate('
MATCH (n)
WHERE n.id IN $nodes_list
RETURN n
', '
DETACH DELETE n
', {batchSize:1000, iterateList:true, params:{nodes_list:nodes_list}}) YIELD batch, operations
RETURN 1
Thank you for your reply.
That looks great. I managed to installed apoc, run the query, 1 is returned, so the query seems to be executed with success. But the nodes and edges are still there :
I feel like this vodoo spell needs to be optimized ;-)
Can you tell me the labels of your nodes and their properties?
Sure. Attached my nodes file (no properties, just a label). Is that the cause of the issue ?
Nodes.txt (869 Bytes)
Best regards
I created an id property on my examples, that's why I'm asking.
Try this one, it use the Neo4j id:
MATCH (a)-[*]-(b)
WITH id(a) AS id, apoc.coll.sort(apoc.coll.toSet(collect(DISTINCT id(b)) + [id(a)])) AS nodes_list
WITH DISTINCT nodes_list, size(nodes_list) AS size
WHERE size <= 10
WITH nodes_list
CALL apoc.periodic.iterate('
MATCH (n)
WHERE id(n) IN $nodes_list
RETURN n
', '
DETACH DELETE n
', {batchSize:1000, iterateList:true, params:{nodes_list:nodes_list}}) YIELD batch, operations
RETURN 1
Regards,
Cobra
It works like charm.
I will try to understand this query.
Thank you very much !
Happy to hear
Don't hesitate if you have any trouble to understand my query
Hi,
Thank you for your feedback. I launch the query yesterday on a large database (5 million nodes, and 10 millions Edges) and the process is still running. So, I am not sure this way will fit my need.
What I do with Gephi : There's the possibility to run stats for a specific set of data. By running the stat based on network components, you obtain a component ID for each network. Then, you can filter out the networks that are smaller than a certain size.The pb with Gephi is that he can't manage big data.
Would it be possible to do more or less the same thing with Neo4j : First, obtaining some stats on the data and then filtering out unintersting data ?
Another approach would be to obtain these stats an rather than deleting small networks, make a query to obtain the list of a specific Nodes for all networks greater than a certain size.
I'm not suer I 'm very clear...
Base on your knoledge what would be the best option to that with a large database ?
Best regards,
Hello @1113 ;)
First, did you use UNIQUE CONSTRAINTS to create your nodes?
Yeah, your way should be also possible on Neo4j, I will try tomorrow
Regards,
Cobra
Hi Cobra,
I didn't use unique constraints to create the nodes.
I will check the doc to determine how to do that.
Looking forward to get your feedback
Best regards,
@1113 ;-)
The UNIQUE CONSTRAINT should speed up the query, it's something to have when you work with Neo4j
Hi, I added the UNIQUE CONSTRAINT on all the entities (all are unique) and relaunched the query.
Let's see :-)
Have a great day !
Can you tell me which property is unique? Like this we can use this one in the query I gave you
Thanks, you too!
Hi,
In fact I use several Entitities : Item1, Item2, ... they don't have any properties except their Label and they are all unique.
Best regards,
Ok, so we will have to create a community for each size and tag each node with his community in order to delete them but I don't know if it will be faster. In your case, the problem is you don't have a unique property, that's why everything takes time I think