Reliably create relationships on 12million+ nodes

mikeM · July 23, 2020, 4:07pm

Hi Neo4J'ers,

I'm new to this platform and I'm struggling with the varying methods to perform commands. I've read all sorts of previous questions and neo4j docs but nothing seems to work.

The data I'm using is quite large, there's a people csv of ~8 million rows, each with 14 properties and a company csv with ~4.5 million records with 25 properties each.

I've finally managed to create the 12.5million nodes after a lot of trial, error and a lot more waiting by using PERIODIC COMMIT LOAD CSV and upping the memory configs. However, nothing I seem to be able to find allows for the creation of relationships between those nodes. Once they're in the database, no amount of apoc iterate calls seem to be able to handle the volume of data to check and create relationships between.

The relationships I'm trying to create are:

Person is member of company (CompanyID is same on both)
Person is same as other Person (Name + DOB is same). People can be declared more than one time, with different meta data depending on their ties to company.

How do bulk relationships get assigned reliably?

A relationship I'm after works if I've only got a few records, so the logic seems okay.

MATCH (a:PSC),(b:PSC)
WHERE a.name = b.name AND a.dateOfBirth = b.dateOfBirth
CREATE (a)-[r:SAME_PERSON]->(b)
RETURN type(r), r.companyNumber

I can't seem to run this over a periodic commit so it fails quite quickly as the memory runs out.

I've tried apoc iterate too...

CALL apoc.periodic.iterate("MATCH (a:PERSON),(b:Company) WHERE a.companyNumber = b.companyNumber CREATE (a)-[r:IS_PSC_OF { companyNumber: a.companyNumber + '<->' + b.companyNumber }]->(b)",
"RETURN type(r), r.companyNumber",
{batchSize:10000, parallel: true, iterateList:true})

This doesn't work, no matter how much I try to tweak the memory settings. At best it runs for about 16 hours before falling over.

Do I assign the relationships as they're created?

Is this not possible on a Macbook with 16Gig RAM or should I be doing this remotely on a box with a bit more punch?

Sorry for the winding post. I've been at this for a while making terribly slow progress so any help would be appreciated.

Cheers

stefan.armbruster · July 24, 2020, 6:05pm

In apoc.periodic.iterate the 2nd statement is batched execute for each result of the first, so you need to do CREATE in the 2nd one:

CALL apoc.periodic.iterate("MATCH (a:PERSON),(b:Company) WHERE a.companyNumber = b.companyNumber",
"CREATE (a)-[r:IS_PSC_OF { companyNumber: a.companyNumber + '<->' + b.companyNumber }]->(b)",
{batchSize:10000, parallel: true, iterateList:true})

You might even speed up the first query by avoiding a cartesian product. Assuming you have fewer PERSON than Company you could do:

MATCH (b:Company)
MATCH (a:PERSON {companyNumber:b.companyNumber})
RETURN a,b

Be sure to have an index on PERSON(companyNumber) to speed up that lookup.

mikeM · July 26, 2020, 10:40am

You're a steely eyed missile man Stefan, thanks!

I had to tweak the snippet you gave to give a return in the first statement or it wouldn't run, but I found that in the docs after playing round with your suggestion. I've created millions of relationships within 2 minutes. After days of trying to get it to work, I really appreciate the help.

stefan.armbruster · July 26, 2020, 3:10pm

Too obvious, indeed I forgot about the return in the first statement. That's what happens when you reply to forums from your hamook. Being glad you've sorted things out.

mikeM · August 3, 2020, 1:46pm

Whilst that first instance worked. It is still temperamental in running. 90% of the time if I have to re-seed the collection it just fails.

Admittedly I'm new to Neo4j but it's really frustrating to work with. The same statements, run independently of each other, fail most of the time, so it's nigh on impossible to get consistent results.

The more complex relationship I need to create never works. Not even after two days of watching it spin around. :) It's a bigger set of types and creating relationships between them requires more property checks, but I can't seem to get close to getting my data seeded correctly because of all these failures.

This is the call:

CALL apoc.periodic.iterate(
"MATCH(a:PSC) MATCH(b:PSC) WHERE a.name = b.name AND a.dateOfBirth = b.dateOfBirth AND NOT(a.companyNumber = b.companyNumber) RETURN a",
"CREATE (a)-[r:SAME_PSC { name: a.name + '<->' + b.name }]->(b)",
{batchSize:10000, parallel: true, iterateList:true})

Based on the above advice, this seems a practicable and reasonable call. Please do correct me if that's wrong! If I do this over say 10 documents, it works and produces the relationships I'm looking to get. But over 8 million PSC types, this blows up endlessly.

I've played with the permitted memory usages in the DB settings.
Is there any basic configuration to the Neo4j desktop browser that I should have done before hand?
Is there some way to make sure that it doesn't hit the heap size but buffers effectively instead of dying each time?

Are all these errors the norm for Neo4j? Perhaps my installation is messed up.

Cheers

andrew_bowman · August 4, 2020, 6:06pm

You'll want to run an EXPLAIN of the outer query, watch out for Eager operators, make sure indexes are being used for lookup (though one of those two matches will have to use a label scan, no way around that).

You might also consider a composite index on :PSC(name, dateOfBirth)

mikeM · August 7, 2020, 1:58pm

Thanks for the pointers Andrew, the reminder about the index helped somewhat, the process seemed to be achieving something until I realised that applications were shutting down because the DB had grown to over 125gb.

Now however, after deleting that database and starting again to release disk space so my machine can run, when I start any new database and close the connection, the client will not allow me to start it again and connect to the server, so I've given up and deleted the application.

Maybe it is just the desktop client, but it really doesn't seem fit for purpose and I've had enough of trying to coax it into doing what is supposed to be it's most basic function.

I might try the Node library rather than the client but my patience is completely shot with it at the moment. Such a shame because it seemed really promising for our organisation.

Cheers

Topic		Replies	Views
Skiping relationship creation if already exist, not about MERGE Cypher apoc , performance	2	303	February 9, 2021
Creation of relationship in bulk using apoc.periodic.iterate Procedures & APOC apoc , relationship	7	621	September 13, 2023
Very slow cypher queries to create relationships Import / Export apoc , performance , browser , relationship	1	1498	December 16, 2020
Bulk creation of relationships on existing nodes Procedures & APOC import	2	1628	June 3, 2021
Bulk creation of Relationships Neo4j Graph Platform	2	2032	February 27, 2019

July Summer Fun!

Reliably create relationships on 12million+ nodes

Related topics