cancel
Showing results for 
Search instead for 
Did you mean: 

How to remove connected components less than x nodes?

1113
Node Clone

I would like to remove small networks (connected components) that have less than x nodes.
So, if the network component has x nodes or less, the nodes and the edges that belong to this component will be deleted.

Is that doable ?

1 ACCEPTED SOLUTION

Cobra
Ninja
Ninja

You didn't replace the id property by EntityID:

CALL gds.wcc.stream({
    nodeProjection: "Entity",
    relationshipProjection: "IRW"
})
YIELD nodeId, componentId
WITH componentId, collect(gds.util.asNode(nodeId).EntityID) AS libraries
WITH componentId, libraries, apoc.create.uuid() AS uuid
CALL apoc.periodic.iterate('
  MATCH (n)
  WHERE n.EntityIDIN $nodes_list
  RETURN n
  ', '
  SET n.uuid = $uuid
  ', {batchSize:1000, params:{nodes_list:libraries, uuid:uuid}}) YIELD batch, operations
RETURN 1

View solution in original post

73 REPLIES 73

Cobra
Ninja
Ninja

Hello @1113

Yeah, it looks possible. Can you show us a little example with an image with what you want to keep and what you want to delete?

Regards,
Cobra

1113
Node Clone

Hi,

Thank you for your reply
What I would like to remove are the nodes/edges in the red area.

Regards,

Cobra
Ninja
Ninja

Can you execute CALL db.schema.visualization() on your database and show us the result please?

I reduce the amount of data so it will probably be more clear.
Below an example of what I would like to do :
In the green circle the Nodes/Edges I would like to keep.
I would like to remove the rest because they are smaller than 6 nodes.

Attached the result of CALL db.schema.visualization()
Schema.txt (1.1 KB)

And The csv files that I used as data source
Edges.txt (1.3 KB) Nodes.txt (869 Bytes)

Thanks in advance !

Cobra
Ninja
Ninja

This query should delete for example, connected components that have 10 nodes or less
You will need the APOC plugin installed on the database.

MATCH (a)-[*]-(b)
WITH id(a) AS id, apoc.coll.sortText(apoc.coll.toSet(collect(DISTINCT b.id) + [a.id])) AS nodes_list
WITH DISTINCT nodes_list, size(nodes_list) AS size
WHERE size <= 10
WITH nodes_list
CALL apoc.periodic.iterate('
    MATCH (n)
    WHERE n.id IN $nodes_list
    RETURN n
    ', '
    DETACH DELETE n
    ', {batchSize:1000, iterateList:true, params:{nodes_list:nodes_list}}) YIELD batch, operations
RETURN 1

1113
Node Clone

Thank you for your reply.
That looks great. I managed to installed apoc, run the query, 1 is returned, so the query seems to be executed with success. But the nodes and edges are still there :

I feel like this vodoo spell needs to be optimized 😉

Cobra
Ninja
Ninja

Can you tell me the labels of your nodes and their properties?

1113
Node Clone

Sure. Attached my nodes file (no properties, just a label). Is that the cause of the issue ?
Nodes.txt (869 Bytes)

Best regards

Cobra
Ninja
Ninja

I created an id property on my examples, that's why I'm asking.
Try this one, it use the Neo4j id:

MATCH (a)-[*]-(b)
WITH id(a) AS id, apoc.coll.sort(apoc.coll.toSet(collect(DISTINCT id(b)) + [id(a)])) AS nodes_list
WITH DISTINCT nodes_list, size(nodes_list) AS size
WHERE size <= 10
WITH nodes_list
CALL apoc.periodic.iterate('
    MATCH (n)
    WHERE id(n) IN $nodes_list
    RETURN n
    ', '
    DETACH DELETE n
    ', {batchSize:1000, iterateList:true, params:{nodes_list:nodes_list}}) YIELD batch, operations
RETURN 1

Regards,
Cobra

1113
Node Clone

It works like charm.
I will try to understand this query.
Thank you very much !

Cobra
Ninja
Ninja

Happy to hear

Don't hesitate if you have any trouble to understand my query

1113
Node Clone

Hi,

Thank you for your feedback. I launch the query yesterday on a large database (5 million nodes, and 10 millions Edges) and the process is still running. So, I am not sure this way will fit my need.
What I do with Gephi : There's the possibility to run stats for a specific set of data. By running the stat based on network components, you obtain a component ID for each network. Then, you can filter out the networks that are smaller than a certain size.The pb with Gephi is that he can't manage big data.
Would it be possible to do more or less the same thing with Neo4j : First, obtaining some stats on the data and then filtering out unintersting data ?
Another approach would be to obtain these stats an rather than deleting small networks, make a query to obtain the list of a specific Nodes for all networks greater than a certain size.
I'm not suer I 'm very clear...
Base on your knoledge what would be the best option to that with a large database ?

Best regards,

Cobra
Ninja
Ninja

Hello @1113 😉

First, did you use UNIQUE CONSTRAINTS to create your nodes?

Yeah, your way should be also possible on Neo4j, I will try tomorrow

Regards,
Cobra

1113
Node Clone

Hi Cobra,

I didn't use unique constraints to create the nodes.
I will check the doc to determine how to do that.
Looking forward to get your feedback

Best regards,

@1113 😉

Cobra
Ninja
Ninja

The UNIQUE CONSTRAINT should speed up the query, it's something to have when you work with Neo4j

1113
Node Clone

Hi, I added the UNIQUE CONSTRAINT on all the entities (all are unique) and relaunched the query.
Let's see 🙂
Have a great day !

Cobra
Ninja
Ninja

Can you tell me which property is unique? Like this we can use this one in the query I gave you
Thanks, you too!

Hi,

In fact I use several Entitities : Item1, Item2, ... they don't have any properties except their Label and they are all unique.

Best regards,

Cobra
Ninja
Ninja

Ok, so we will have to create a community for each size and tag each node with his community in order to delete them but I don't know if it will be faster. In your case, the problem is you don't have a unique property, that's why everything takes time I think

Cobra
Ninja
Ninja

This query will set a community_id property for each node where the community_id is the size of the network where the node is:

MATCH (a)-[*]-(b)
WITH id(a) AS id, apoc.coll.sort(apoc.coll.toSet(collect(DISTINCT id(b)) + [id(a)])) AS nodes_list
WITH DISTINCT nodes_list, size(nodes_list) AS size
WITH size, apoc.coll.flatten(collect(nodes_list)) AS nodes_list
CALL apoc.periodic.iterate('
    MATCH (n)
    WHERE id(n) IN $nodes_list
    SET n.community_id = $community_id
    ', '
    DETACH DELETE n
    ', {batchSize:1000, params:{nodes_list:nodes_list, community_id:size}}) YIELD batch, operations
RETURN 1

After if you want to delete the connected components that have less than 5 nodes:

CALL apoc.periodic.iterate('MATCH (n) WHERE n.community_id < 5 RETURN n', 'DETACH DELETE n', {batchSize:1000})

Regards,
Cobra

The query is running since a few hours. I will keep you posted tomorrow 😉

Thank you very much !

1113
Node Clone

Hi Cobra,

The query is still running this morning. Do we have the possibility to know what % of the job is done ?

Have a great day !

Cobra
Ninja
Ninja

Hi @1113

I'm confused because it should not be so long
Can you give me the configuration of your database?
How many nodes and relations do you have?
Did you use Hardware Sizing Calculator to choose your database?

With this query, you can get the percentage:

MATCH (a) RETURN toFloat(count(a.community_id)) / toFloat(count(a)) * 100

Regards,
Cobra

1113
Node Clone

Hi Cobra,

Attached the DB configuration (I used the default config)
Neo4j-conf.txt (36.8 KB)

Regarding the number of nodes/relationships :
5,253,112 nodes (5 labels)
10,260,019 relationships (1 types)

I tried Hardware Sizing Calculator and here's the result :
Recommended System Requirements:

|Number of Cores|1|
|Size on Disk|1.0 GB|

Summary
Number of nodes 5,000,000
Number of relationships 10,000,000
Properties per Node 1
Properties per Relationship 1
Estimated graph size on disk 1.0 GB
Concurrent requests per second 1
Average request time 1 ms

The result of the query to obtain de % is : 0.0 (Strange isn't it ?)

Best regards ! 🙂

Cobra
Ninja
Ninja

I think, you should increase the RAM of your database

1113
Node Clone

I increased :

dbms.memory.heap.max_size=4G
and :
dbms.memory.pagecache.size=2G

Makes sense ?

Just relaunched the query with these new parameters

Let's try !

Cobra
Ninja
Ninja

Ok, nice

You should try to add a unique property, for example create an id equal to the Neo4j id and put a unique constraint on it. After we could use it in the query and it should be faster.

I'm not sure to understand. At the moment I have :
Entity:ID,description:LABEL
232ecace75a347258eb690c045322173,Item1
b4ca6276726c4bd9997ca2650b7177b0,Item2

Do you mean I should have someting like :
Entity:ID,description:LABEL,Property:PROPERTY
232ecace75a347258eb690c045322173,Item1,UniqueString1
b4ca6276726c4bd9997ca2650b7177b0,Item2,UniqueString2

If so, Can I use the conacatenation of the ID and the label. So I would have :
Entity:ID,description:LABEL,Property:PROPERTY
232ecace75a347258eb690c045322173,Item1,232ecace75a347258eb690c045322173Item1
b4ca6276726c4bd9997ca2650b7177b0,Item2,b4ca6276726c4bd9997ca2650b7177b0Item2

Cobra
Ninja
Ninja

Is Entity:ID unique for each node?

Can you execute CALL db.schema.visualization() on your database and show us the screenshot please?

Can you take a screenshot of your labels and properties on the left please.

1113
Node Clone

I reimported the data structured as :
Entity:ID,UniqEntity,description:LABEL
e53628fb3f714cbc9eb2546cecc7064c,e53628fb3f714cbc9eb2546cecc7064c,Item1
34c075e8781244bdb933c3539cdf167c,34c075e8781244bdb933c3539cdf167c,Item3

I add added a constraint on UniqEntity :
CREATE CONSTRAINT ON(l:UniqEntity) ASSERT l.id IS UNIQUE
(I hpe the syntax is ok (it seems to be))

And here's the screenshot :

Have a nice evening

Cobra
Ninja
Ninja

Test this query:

MATCH (a)-[*]-(b)
WITH id(a) AS id, apoc.coll.sort(apoc.coll.toSet(collect(DISTINCT b.id)) + [a.id])) AS nodes_list
WITH DISTINCT nodes_list, size(nodes_list) AS size
WITH size, apoc.coll.flatten(collect(nodes_list)) AS nodes_list
CALL apoc.periodic.iterate('
    MATCH (n)
    WHERE n.id IN $nodes_list
    SET n.community_id = $community_id
    ', '
    DETACH DELETE n
    ', {batchSize:1000, params:{nodes_list:nodes_list, community_id:size}}) YIELD batch, operations
RETURN 1

After if you want to delete the connected components that have less than 5 nodes:

CALL apoc.periodic.iterate('MATCH (n) WHERE n.community_id < 5 RETURN n', 'DETACH DELETE n', {batchSize:1000})

1113
Node Clone

I feel like my new import is not correct because I have the folowing message :

Invalid input ')': expected whitespace, '.', node labels, '[', '^', '*', '/', '%', '+', '-', "=~", IN, STARTS, ENDS, CONTAINS, IS, '=', '~', "<>", "!=", '<', '>', "<=", ">=", AND, XOR, OR, AS, ',', ORDER, SKIP, LIMIT, WHERE, FROM GRAPH, USE GRAPH, CONSTRUCT, LOAD CSV, START, MATCH, UNWIND, MERGE, CREATE UNIQUE, CREATE, SET, DELETE, REMOVE, FOREACH, WITH, CALL, RETURN, UNION, ';' or end of input (line 2, column 83 (offset: 100))
"WITH id(a) AS id, apoc.coll.sort(apoc.coll.toSet(collect(DISTINCT b.id)) + [a.id])) AS nodes_list"

What is strange is that I don't have any special charatere the csv file. (except ',' as separator)
So, I'm a bit confused

Cobra
Ninja
Ninja

Can you tell me what is the difference between your 3 properties (Entity, UniqEntity and id)?

There is a syntax error:

MATCH (a)-[*]-(b)
WITH id(a) AS id, apoc.coll.sortText(apoc.coll.toSet(collect(DISTINCT b.id) + [a.id])) AS nodes_list
WITH DISTINCT nodes_list, size(nodes_list) AS size
WITH size, apoc.coll.flatten(collect(nodes_list)) AS nodes_list
CALL apoc.periodic.iterate('
    MATCH (n)
    WHERE n.id IN $nodes_list
    SET n.community_id = $community_id
    ', '
    DETACH DELETE n
    ', {batchSize:1000, params:{nodes_list:nodes_list, community_id:size}}) YIELD batch, operations
RETURN 1

Regards,
Cobra

Hi Cobra,

Thank you, I just relaunched the Query.

Regarding your question, based on my CVS file :

Entity:ID,UniqEntity,description:LABEL
e53628fb3fy14cbc9eb2546cecc70645,e53628fb3fy14cbc9eb2546cecc70645,Item1

  • Entity (the ID) is the unique identifier for the Item type.
  • The Item Type can be found in the "description" field : It can be Item1, Item2, Item3, Item4 or Item5
  • UniqEntity is just a copy of the Entity Value (ID). just to have a property that is unique. Is that what you expected from me to change in the file format ?

Have a great day !

Cobra
Ninja
Ninja

Why did you not use Entity:ID for the unique constraint instead of duplicate it?

Have a great day too!

Ha sorry, I didn't understand correctly. So "UniqEntity" is useless, I gonna remove it, and I will put a constraint on ID and relaunch the query.

Can we keep the same query or should it be updated ?

Best regards,

Cobra
Ninja
Ninja

If the property name is id, you can use this one:

MATCH (a)-[*]-(b)
WITH id(a) AS id, apoc.coll.sortText(apoc.coll.toSet(collect(DISTINCT b.id) + [a.id])) AS nodes_list
WITH DISTINCT nodes_list, size(nodes_list) AS size
WITH size, apoc.coll.flatten(collect(nodes_list)) AS nodes_list
CALL apoc.periodic.iterate('
    MATCH (n)
    WHERE n.id IN $nodes_list
    SET n.community_id = $community_id
    ', '
    DETACH DELETE n
    ', {batchSize:1000, params:{nodes_list:nodes_list, community_id:size}}) YIELD batch, operations
RETURN 1

This requires more memory than the db size as Cypher needs to keep all these paths into memory. So, it can kick off garbage collection.

Have you tried the weakly connected components algo in GDS?

https://neo4j.com/docs/graph-data-science/current/algorithms/wcc/

May be this can help identify the weekly connected components. Once you run you can determine smaller communities and delete them

Cobra
Ninja
Ninja

Oh I forget this one

Yeah it could work

1113
Node Clone

Thank you for the suggestion, I gonna try that.
Best regards !

1113
Node Clone

Found that :

CALL gds.wcc.stream({
nodeProjection: "Library",
relationshipProjection: "DEPENDS_ON"
})
YIELD nodeId, componentId
RETURN componentId, collect(gds.util.asNode(nodeId).id) AS libraries
ORDER BY size(libraries) DESC;

would that be a good starting point ?

Cobra
Ninja
Ninja

Yeah, good start!

If you want to do everything in one time (maybe you have to change the nodeProjection and the relationship Projection). In my query, it will delete communities which have less than 6 nodes.

CALL gds.wcc.stream({
    nodeProjection: "Item",
    relationshipProjection: "BELONGS_TO"
})
YIELD nodeId, componentId
WITH componentId, collect(gds.util.asNode(nodeId).id) AS libraries
WITH size(libraries) AS size, libraries
WHERE size < 6
WITH apoc.coll.flatten(collect(libraries)) AS nodes_list
CALL apoc.periodic.iterate('
    MATCH (n)
    WHERE n.id IN $nodes_list
    RETURN n
    ', '
    DETACH DELETE n
    ', {batchSize:1000, params:{nodes_list:nodes_list}}) YIELD batch, operations
RETURN 1

If you want to do it in two times:

  • Save the size of the community in a property:
CALL gds.wcc.stream({
    nodeProjection: "Item",
    relationshipProjection: "BELONGS_TO"
})
YIELD nodeId, componentId
WITH componentId, collect(gds.util.asNode(nodeId).id) AS libraries
WITH size(libraries) AS size, libraries
WITH size, apoc.coll.flatten(collect(libraries)) AS nodes_list
CALL apoc.periodic.iterate('
    MATCH (n)
    WHERE n.id IN $nodes_list
    RETURN n
    ', '
    SET n.community_id = $community_id
    ', {batchSize:1000, params:{nodes_list:nodes_list, community_id:size}}) YIELD batch, operations
RETURN 1
  • Next, to delete, for example connected components that have less than 6 nodes:
CALL apoc.periodic.iterate('MATCH (n) WHERE n.community_id < $community_id RETURN n', 'DETACH DELETE n', {batchSize:1000, params:{community_id:6}})

Regards,
Cobra

1113
Node Clone

Hi,

I tried to delete all the nodes to do another import with the following command :

match (a) -[r] -> () delete a, r

And after a while, I got this error message :

Neo.DatabaseError.Transaction.TransactionCommitFailed

Makes me think to a db settings issue (Maybe to root cause of the issue with the query no eonding ?)
My settings :
dbms.directories.import=import
dbms.security.auth_enabled=true
dbms.memory.heap.initial_size=512m
dbms.memory.heap.max_size=4G
dbms.memory.pagecache.size=2G
dbms.tx_state.memory_allocation=ON_HEAP
dbms.connector.bolt.enabled=true
dbms.connector.http.enabled=true
dbms.connector.https.enabled=false
dbms.security.procedures.unrestricted=apoc.*
dbms.jvm.additional=-XX:+UseG1GC
dbms.jvm.additional=-XX:-OmitStackTraceInFastThrow
dbms.jvm.additional=-XX:+AlwaysPreTouch
dbms.jvm.additional=-XX:+UnlockExperimentalVMOptions
dbms.jvm.additional=-XX:+TrustFinalNonStaticFields
dbms.jvm.additional=-XX:+DisableExplicitGC
dbms.jvm.additional=-XX:MaxInlineLevel=15
dbms.jvm.additional=-Djdk.nio.maxCachedBufferSize=262144
dbms.jvm.additional=-Dio.netty.tryReflectionSetAccessible=true
dbms.jvm.additional=-Djdk.tls.ephemeralDHKeySize=2048
dbms.jvm.additional=-Djdk.tls.rejectClientInitiatedRenegotiation=true
dbms.jvm.additional=-XX:FlightRecorderOptions=stackdepth=256
dbms.jvm.additional=-XX:+UnlockDiagnosticVMOptions
dbms.jvm.additional=-XX:+DebugNonSafepoints
dbms.windows_service_name=neo4j

Seems ok to you ?

Have a great day !

1113
Node Clone

Tried another time and got :

Neo.DatabaseError.Statement.ExecutionFailed
Java heap space

Cobra
Ninja
Ninja

To delete everything in the database, you should use:

CALL apoc.periodic.iterate('MATCH (n) RETURN n', 'DETACH DELETE n', {batchSize:1000})

1113
Node Clone

Thank you !
BTW, I increased dbms.memory.heap.max_size to 16G and the delete query have been executed

Cobra
Ninja
Ninja

It's another way but it's always better to use the query I gave you

1113
Node Clone

Hi Cobra,

The Query is sucesseful but it seems no nodes are deleted. The query terminates very fast too :

Query used :


CALL gds.wcc.stream({
    nodeProjection: "Entity",
    relationshipProjection: "DEPENDS"
})
YIELD nodeId, componentId
WITH componentId, collect(gds.util.asNode(nodeId).id) AS libraries
WITH size(libraries) AS size, libraries
WHERE size < 16
WITH apoc.coll.flatten(collect(libraries)) AS nodes_list
CALL apoc.periodic.iterate('
    MATCH (n)
    WHERE n.id IN $nodes_list
    RETURN n
    ', '
    DETACH DELETE n
    ', {batchSize:1000, params:{nodes_list:nodes_list}}) YIELD batch, operations
RETURN 1

Cobra
Ninja
Ninja

Can you show me what is returned by:

CALL gds.wcc.stream({
    nodeProjection: "Entity",
    relationshipProjection: "DEPENDS"
})
YIELD nodeId, componentId
WITH componentId, collect(gds.util.asNode(nodeId).id) AS libraries
WITH size(libraries) AS size, libraries
RETURN *

1113
Node Clone

Sure. Here's the result :

Cobra
Ninja
Ninja

Can you show me your properties on the right please? And tell me the one which is unique please

1113
Node Clone

Sure. Here you go :

Here's the headers of my csv file :
Entity:ID,description:LABEL

Entity and ID is the same data . I added a unique constrainte on "Entity" even if I guess it is done automatically because Entity is used as ID.

Best regards

Cobra
Ninja
Ninja

What is returned by:

CALL gds.wcc.stream({
    nodeProjection: "Entity",
    relationshipProjection: "DEPENDS"
})
YIELD nodeId, componentId
WITH componentId, collect(gds.util.asNode(nodeId).Entity) AS libraries
WITH size(libraries) AS size, libraries
RETURN *

1113
Node Clone

Same result :

Cobra
Ninja
Ninja

Try this:

CALL gds.wcc.stream({
    nodeProjection: "Entity",
    relationshipProjection: "DEPENDS"
})
YIELD nodeId, componentId
WITH componentId, collect(gds.util.asNode(nodeId)) AS libraries
RETURN *

1113
Node Clone

Same result, no data returned 🙂

Cobra
Ninja
Ninja

There is something weird...
And?

CALL gds.wcc.stream({
    nodeProjection: "Entity",
    relationshipProjection: "DEPENDS"
})
YIELD nodeId, componentId
RETURN *

1113
Node Clone

Same.
I tried also :

CALL gds.wcc.stream({
    nodeProjection: "Entity",
    relationshipProjection: "DEPENDS"
})
YIELD nodeId
RETURN * 

and

CALL gds.wcc.stream({
    nodeProjection: "Entity",
    relationshipProjection: "DEPENDS"
})
YIELD componentId
RETURN *

but same : no output

Cobra
Ninja
Ninja

I have no idea...
You made it wotk before and now it's not working anymore...
Is GDS still installed?

Anyway, these queries in my message should solve the main issue but maybe you will have to change the property depending of the one which is unique

Regards,
Maxime

1113
Node Clone

Hi Maxime,

Yes GDS is still installed. A nother approach would be for each node, assoicated a network componentID and the number of nodes associated to this Component. Do you think that would be possible to do that ?

Best regards

Cobra
Ninja
Ninja

I'm sorry but I already put you the two ways on a previous message and both requests are working on my local database and I use the same labels and properties as yours

I don't know what to try anymore, maybe create a completely new database and retry. It was working on your database and now it doesn't...

The first query works but not for a large dataset. I will continue to seach why the second approach doesn't work. Anyway, I would like to thank you very much for your precious help.
I will keep you posted if I find a solution 😉

1113
Node Clone

Hi Maxime,

Good news, I found the source if the issue : It was my csv headers who were not correct.
I fixed that but the query is still very, very slow. I started the query yesterday night and this morning, it was still running. I will create a new thread for this point with details 😉

Have a great day !

Cobra
Ninja
Ninja

Oh nice @1113

Even with the GDS query it's slow?
Did you use UNIQUE CONSTRAINTS and change the query to use this unique constraint?

I applied a unique constraint :

CREATE CONSTRAINT ON(l:Entity) ASSERT l.EntityID IS UNIQUE

But not 100% sure that is correctly reflected in the query :

CALL gds.wcc.stream({
    nodeProjection: "Entity",
    relationshipProjection: "IRW"
})
YIELD nodeId, componentId
WITH componentId, collect(gds.util.asNode(nodeId).EntityID) AS libraries
WITH size(libraries) AS size, libraries
WITH size, apoc.coll.flatten(collect(libraries)) AS nodes_list
CALL apoc.periodic.iterate('
    MATCH (n)
    WHERE n.EntityID IN $nodes_list
    RETURN n
    ', '
    SET n.community_id = $community_id
    ', {batchSize:1000, params:{nodes_list:nodes_list, community_id:size}}) YIELD batch, operations
RETURN 1

I use just one LABEL : Entity.
EntityID is unique it is the ID of the LABEL

Here's my csv for Nodes :
EntityID:ID,description,:LABEL
232ec2ce7ea347258eb640c345322173,Item1,Entity

And the csv for Edges :
Source:START_ID,Target:END_ID,:TYPE
e53628fb3f414cbc9eb2546cedc70645,34c073e8781244bdb934c3539cdf1674,IRW

Cobra
Ninja
Ninja