Reversing every relationship in a large graph

ben5 · January 4, 2021, 8:14pm

I made a mistake and ingested 160 million relationships the wrong way (on 32 million nodes). The nodes are PubMed article_ids and the relationships are citations. I have (a:Article)-[:CITES]->(b:Article) where it should be (a:Article)<-[:CITES]-(b:Article).

I have tried the following:

MATCH (a:Article)-[rel:CITES]->(b:Article)
CALL apoc.refactor.invert(rel)
yield input, output RETURN COUNT(rel);

but keep getting (after about half an hour or more) "Server at localhost(127.0.0.1):7687 is no longer available".

I'm not sure how to deal with this error — is my large query crashing the server? I previously increased dbms.memory.heap.max_size to deal with an out-of-memory error.

My dedicated machine has 16 GB of RAM and the nodes consist of only article_id's (from 1-32 million).

If the apoc won't run, is there another way of doing this? For instance, I could create all the reverse relationships manually and then delete all the old CITES?

This works on my small test graph, is it a good idea to run it on such a large graph?

MATCH (m:Article)-[c:CITES]->(n:Article)
DELETE c
CREATE (m)<-[:CITES]-(n);

Are we sure that the operation will be all-or-none (i.e. atomic)? The last thing i would want is for some unknown number of relationships to be reversed.

I guess I could do:

MATCH (m:Article)-[c:CITES]->(n:Article)
DELETE c
CREATE (m)<-[:REFERENCES]-(n);

MATCH (m:Article)-[r:REFERENCES]->(n:Article)
DELETE r
CREATE (m)-[:CITES]->(n);

Edit: I am running the above on my large dataset, and the first statement has been running for over two hours, with the following CPU usage:

Screen Shot 2021-01-04 at 7.41.54 PM

Edit2: Three and a half hours in, it crashed with "Server at localhost(127.0.0.1):7687 is no longer available"

My next solution was to divide to problem into batches. In Python:

batch_size = 5000
max_id = 33307598
driver = neo4j.GraphDatabase.driver(neo4j_uri, auth=neo4j_auth)
session = driver.session()
for batch in tqdm(range(max_id // batch_size )):   
query = ("MATCH (m:Article)-[c:CITES]->(n:Article) " +
            "WHERE m.ArticleId >= " + str(batch*batch_size) + " AND m.ArticleId < " + str((batch+1)*batch_size) + " " +
            "DELETE c " +
            "CREATE (m)<-[:REFERENCES]-(n);")
   print(query)
   result = session.run(query)  
session.close()
driver.close()

Unfortunately the first iteration of this took nearly a minute, meaning this process extrapolates to 100 hours. It would be faster just to reingest.

terryfranklin82 · January 5, 2021, 5:39am

Sounds like a job for apoc.periodic.iterate, let the library take care of batching (and optional parallel execution) for you.

There's several examples of how to use it on that page.

Topic		Replies	Views
How to reverse relations on 10 million objects Cypher apoc , cypher	2	272	September 13, 2021
Reliably create relationships on 12million+ nodes Cypher	6	832	August 7, 2020
Create large amount of relationships [not enough memory] Desktop cypher , relationship	6	477	August 10, 2020
Creating relationship over several millions of nodes Cypher apoc , performance , cypher , relationship	23	2929	September 24, 2020
Very slow cypher queries to create relationships Import / Export apoc , performance , browser , relationship	1	1510	December 16, 2020

Reversing every relationship in a large graph

Related topics