Iterate through list and process nodes, but exclude already processed ones

Hi community,

I have a list of node IDs (e.g., [1, 2, 3, 4, 5]) that correspond to nodes in a Neo4j graph. I want to process these nodes one by one, in the order of the list. During this processing, a node may be linked to another node, possibly one that appears later in the list.

My goal is:

To process each node only once, based on whether it is already connected to something (i.e., has any relationships).

If a node is already connected as a result of being processed earlier, I want to skip it when its turn comes up later in the list.

I tried the following Cypher pattern:

WITH [1, 2, 3, 4, 5] AS ids
UNWIND ids AS id
MATCH (n)
WHERE n.Id = id AND NOT (n)--()
// Do some processing and potentially create new edges

I expected that once a node is linked (e.g., node with Id 4 gets connected during the processing of node 1), it would not be matched again when the UNWIND reaches it later. However, this is not the case โ€” all nodes get processed regardless of changes made earlier in the same query.

Thanks for your help...

While creating new edges use MERGE instead of CREATE and try.

You are assuming the code is running in a particular order. It is possible all the matches are completed first before the query moves to your next step of processing results of the match.

Hi @ameyasoft,

if it was this simple I would have done it for shure.
Unfortunately there is query of about 200-300 lines running after that, which tears down structures and builds them up again according the data. This is time consuming and doing this multiple times for the same, connected nodes would be inefficient.

Thanks for the hint...

Hi @glilienfield,

yeahยด, that is exactly what seems to happen. Like the matches are processed first after the unwind, and the results will be passed to the part of the query where the connecting action takes part.
The changes to the nodes will not be taken into account within the iteration through the result set.

Is there a apoc function or a trick to force the system to process the items one after each other?

Thanks for your quick response (as always :grinning_face:)

Is your list huge? do you need efficiency? is it a process that will run regularly?

Yes, the list is huge and involves a big amount of actions on the nodes/relationships.
It is called very often and will touch clusters with different sizes of nodes from 1 to 1000 nodes.

OK ... but not millions?

I did something similar for creation of relationships, in my case I get 2 types of leaves (let's call them 'values' and 'references') that are connected to a node type A, that node type A is connected to node type B, etc etc ...

I have no idea if any given node (other than references) do exist, so i create my values if they don't exist (and use a provided identifier) and if they pre-exist, fetch them and keep a pair ('this is the id the application proposed', 'this is the one in the DB').

Then find if a node type A already exists that is connected to the list of values/references - if it doesn't, i do similar iteration as above.

So I start with a map [{ proposed: "A", actual: "A" } ] and gradually replace the values of actual as i build my graph [{ proposed: "A", actual: "12345" } ].

In a part of the tree i can have branches that connect to branches i am creating in that same UNWIND, so before I iterate i create a temporary node with 2 properties (one for the list of proposed, one for the actuals) and inside the CALL { } I fetch that node and then update it as I create nodes, then MERGE it back before the end of the call.

But my number of nodes is small-ish (no more than a hundred nodes average), but can be very interconnected.

I found a solution for now...

Problem Description

In my graph, some Person nodes are already assigned to clusters. These clusters are represented by dedicated Cluster nodes that connect to all members of that group.

Now, I want to take not clustered persons from a given list of IDs and group them into new clusters based on their graph connectivity โ€” i.e., if they're connected through any path (via relationships like HAS_FRIEND, LIVES_IN, etc.).

The tricky part is: if you naively process all persons in the list, each node could end up creating its own cluster with all the others, even if they're already part of the same connected subgraph. For example, if there are 100 connected nodes in the list, Person 1 clusters with the other 99, Person 2 does the same, and so on โ€” which obviously defeats the purpose.

Solution Strategy

To prevent that, I do the following:

  1. I start with a list of unclustered person IDs.
  2. For each ID, I retrieve the full connected subgraph of Person nodes (using apoc.path.subgraphNodes).
  3. I then sort the found nodes by their Id and select the one with the lowest ID as the designated "cluster starter".
  4. Only those lowest-ID nodes are used for further cluster creation logic. This ensures each cluster is only created once per connected group.

Query

// Example list of unclustered person IDs
WITH [101,108,104,106,107,110,112] AS personIds

UNWIND personIds AS id
MATCH (startNode:Person {Id: id})
CALL apoc.path.subgraphNodes(startNode, {
minLevel: 0,
uniqueness: "NODE_GLOBAL",
relationshipFilter: "HAS_FRIEND|LIVES_IN|..." // Include all relevant relationship types
}) YIELD node
WHERE node:Person

WITH node, id
ORDER BY node.Id

WITH head(collect(node)) AS starterNode, id
WITH collect(DISTINCT starterNode) AS clusterStarters

RETURN clusterStarters

What this does

This returns a list of the lowest-ID Person nodes from each connected group. These can then be used as the entry points for actual cluster generation, avoiding duplication and inefficiency.


Let me know if you see any optimization potential โ€” especially for large graphs, or if you know of a more elegant way to do the work.

Cheers!