Loading Neo4J Relationships takes literally ages

Dear community, dear Michael,
I have managed to load the nodes with LOAD CSV.
The issue that I am struggeling right now with is creation of the relationships accordingly. Is this a thing which is done quicker with Java due to the internal structure of neo4j?
Thank you very much.
Kindest Regards,
Malte

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///relationship_officeraa' AS row
FIELDTERMINATOR ';'
MATCH (m),(n)
WHERE m.node_id = trim(row.start_id) and n.node_id = trim(row.end_id) and n.source = trim(row.source) and m.source = trim(row.source)
MERGE (m)-[r:OFFICER_OF {source: row.source, status: row.status, start_date: row.start_date, end_date: row.end_date, type: row.links }]-(n)
return  r;

This is how the relationship files do look like:

start_id;end_id;source;status;start_date;end_date;links
605;729;RISE NETWORK;N;N;N;ALEXANDRU SODRINGA partner AQEEL MOHAMED DHUYAB

Loading relationships like this takes ages:
for 3000 created relationships something like 5 days.

indexes are online and on each node type combined node_id and source.

indexes are online and on each node type combined node_id and source.

description indexName tokenNames properties state type progress provider id failureMessage
"INDEX ON :ADDRESS(node_id, source)" "index_19" ["ADDRESS"] ["node_id", "source"] "ONLINE" "node_label_property" 100.0 { "version": "1.0", "key": "native-btree" } 19 ""

match(n) return count(n);

"count(n)"
58570409

Thank you very much.
Kind Regards,
Malte

Hi @MalteR

Maybe your import takes ages because your query tries to take a too big bite at once. Currently, your limit is set to 1000.
Second thing i noticed, that your MATCH / MERGE clauses don't have labels specified. When you use MATCH (m) - it will read through the whole graph - it quite costly.

1 Like

Hi Paul,
thank you very much for your reply.
Can you give me any advice how to get here less costly, actually with regards to labels, I guess you mean type of node, right? I might be able to specify. So my scenario would most likely work faster if I specified the node types and then execute on the database? What's your background on big data size wise?
Kindest Regards,
Malte

First, maybe you can describe a bit your model and data bit deeper.
From the relationship name "OFFICER_OF", I would suggest adding Officer label to (m) nodes.

After reviewing your graph model and adding labels try to run the queries like

PROFILE MATCH (m) RETURN m 
versus
PROFILE MATCH (m: Officer) RETURN m 

You should see the cost difference of these queries

It should speed up things.

1 Like

@MalteR
I am using graphs to store identity access history, the current graph is at the moment not too big and I am using Community Edition, but I believe by the end of 2020 it will grow to a few million nodes - as I add more data sources.

I had a similar issue while syncing data from different data sources and my graph started growing. After i added additional labels that corresponded to each data source, I managed to speed up read/write to graph drastically. Some data loading queries that before took hours, finish now in a few minutes. Thing is that you have to keep in mind - how big dataset your query is addressing. Thinking of subgraphs helped me to improve my model, it may help you also.

I suggest reading "Cypher Tips & Tricks" post made by @michael.hunger
There is a lot of good information there :slight_smile:

Hello Paul,
thank you very much for your input. My background why I am asking about big data is because of what some people think big data would be is pretty small datasets like one million entries. For me this starts not from one milion datasets but from 10 milion datasets the least.
Can I do something a constraint on label (node_type) dynamically? Is there any way to do that loaded from file within the match-part of the cypher query? See attached my labels that would have to be constrained by column of a file if possible.

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///relationship_officeraa' AS row
FIELDTERMINATOR ';'
MATCH (m),(n)
WHERE m.node_id = trim(row.start_id) and n.node_id = trim(row.end_id) and n.source = trim(row.source) and m.source = trim(row.source)
MERGE (m)-[r:OFFICER_OF {source: row.source, status: row.status, start_date: row.start_date, end_date: row.end_date, type: row.links }]-(n)
return  r;

Thank you very much in advance.
Kindest Regards,
Malte

Test this. (You need APOC)

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///import.csv' AS row
FIELDTERMINATOR ';'
	CALL apoc.cypher.run('
	MATCH 
	(m:`'+row.source+'` {node_id: $MNodeID}), 
	(n:`'+row.source+'` {node_id: $NNodeID})
	RETURN m, n'
	, {MNodeID: trim(row.start_id),NNodeID: trim(row.end_id)})	
	YIELD value
WITH value.m as m, value.n as n, row
MERGE (m)-[r:OFFICER_OF {source: row.source, status: row.status, start_date: row.start_date, end_date: row.end_date, type: row.links }]-(n)
return  r;

@MalteR in addition to using row.source as a label. Maybe your data allows you add some labels to (m) and (n) nodes. If they are different types, then it would improve performance more.

Hi @MalteR,

You've created a good composite index. The problem is that it's not being used in your match. The match needs the label portion to utilize the index.

Try replacing MATCH (m),(n) with MATCH (m:Address), (n:Address)

Also as a matter of clarity you can structure it like below (dropping the where):

MATCH (m:Address {node_id:trim(row.start_id), source: trim(row.source)})
MATCH (n:Address {node_id:trim(row.end_id), source: trim(row.source)}) 

Hope that helps,
-Mike