The repetitive behavior is due to the creating a cartesian product of results between the two match statements following the unwind. What is happening is the distinct passes two rows with the same 'n', one for each of the outgoing relationships for the given 'n'. This then causes the following match after the distinct to execute for each outgoing relationships for the same 'n'. Because the same value of 'n' is passed multiple times to the second match you get multiple identical relationships.
There are several approaches. I see you figured out one, which is to collect the results for each value of 'n' passed from the first phase of your query. Do this results in only one row per node 'n' to be passed since the multiple relationships for the given 'n' have been collected.
Another approach, which seems cleaner, is to use a Union , where on part of the query creates the outgoing relationships and the other creates the incoming relationships. This works because each query can be written to returnithe same columns.
Result from first query, with the original nodes and relationships removed to focus on the virtual node and relationships:
Refactored query using UNION approach:
MATCH (n:Sub)
WITH collect(n) AS nodes
WITH apoc.map.mergeList([node IN nodes | apoc.any.properties(node)]) AS mergedProps, nodes
CALL apoc.create.vNode(['vSub', 'VirtualNode'], mergedProps) YIELD node AS virtualNode
CALL {
WITH virtualNode, nodes
UNWIND nodes AS n
MATCH (n)-[r]->(relatedNode)
WITH virtualNode, n, r, relatedNode
CALL apoc.create.vRelationship(virtualNode, type(r), apoc.any.properties(r), relatedNode) YIELD rel AS vRel
RETURN n, vRel, relatedNode
UNION
WITH virtualNode, nodes
UNWIND nodes AS n
MATCH (n)<-[r]-(relatedNode)
WITH virtualNode, n, r, relatedNode
CALL apoc.create.vRelationship(relatedNode, type(r), apoc.any.properties(r), virtualNode) YIELD rel AS vRel
RETURN n, vRel, relatedNode
}
RETURN virtualNode, relatedNode, vRel