Loading Neo4J Relationships takes literally ages

MalteR · January 15, 2020, 9:00am

Dear community, dear Michael,
I have managed to load the nodes with LOAD CSV.
The issue that I am struggeling right now with is creation of the relationships accordingly. Is this a thing which is done quicker with Java due to the internal structure of neo4j?
Thank you very much.
Kindest Regards,
Malte

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///relationship_officeraa' AS row
FIELDTERMINATOR ';'
MATCH (m),(n)
WHERE m.node_id = trim(row.start_id) and n.node_id = trim(row.end_id) and n.source = trim(row.source) and m.source = trim(row.source)
MERGE (m)-[r:OFFICER_OF {source: row.source, status: row.status, start_date: row.start_date, end_date: row.end_date, type: row.links }]-(n)
return  r;

This is how the relationship files do look like:

start_id;end_id;source;status;start_date;end_date;links
605;729;RISE NETWORK;N;N;N;ALEXANDRU SODRINGA partner AQEEL MOHAMED DHUYAB

Loading relationships like this takes ages:
for 3000 created relationships something like 5 days.

indexes are online and on each node type combined node_id and source.

description	indexName	tokenNames	properties	state	type	progress	provider	id	failureMessage
"INDEX ON :ADDRESS(node_id, source)"	"index_19"	["ADDRESS"]	["node_id", "source"]	"ONLINE"	"node_label_property"	100.0	{ "version": "1.0", "key": "native-btree" }	19	""

match(n) return count(n);

"count(n)"
58570409

Thank you very much.
Kind Regards,
Malte

paulare · January 15, 2020, 11:07am

Hi @MalteR

Maybe your import takes ages because your query tries to take a too big bite at once. Currently, your limit is set to 1000.
Second thing i noticed, that your MATCH / MERGE clauses don't have labels specified. When you use MATCH (m) - it will read through the whole graph - it quite costly.

MalteR · January 15, 2020, 11:14am

Hi Paul,
thank you very much for your reply.
Can you give me any advice how to get here less costly, actually with regards to labels, I guess you mean type of node, right? I might be able to specify. So my scenario would most likely work faster if I specified the node types and then execute on the database? What's your background on big data size wise?
Kindest Regards,
Malte

paulare · January 15, 2020, 11:17am

First, maybe you can describe a bit your model and data bit deeper.
From the relationship name "OFFICER_OF", I would suggest adding Officer label to (m) nodes.

After reviewing your graph model and adding labels try to run the queries like

PROFILE MATCH (m) RETURN m 
versus
PROFILE MATCH (m: Officer) RETURN m 

You should see the cost difference of these queries

It should speed up things.

paulare · January 15, 2020, 12:38pm

@MalteR
I am using graphs to store identity access history, the current graph is at the moment not too big and I am using Community Edition, but I believe by the end of 2020 it will grow to a few million nodes - as I add more data sources.

I had a similar issue while syncing data from different data sources and my graph started growing. After i added additional labels that corresponded to each data source, I managed to speed up read/write to graph drastically. Some data loading queries that before took hours, finish now in a few minutes. Thing is that you have to keep in mind - how big dataset your query is addressing. Thinking of subgraphs helped me to improve my model, it may help you also.

I suggest reading "Cypher Tips & Tricks" post made by @michael.hunger
There is a lot of good information there

gist.github.com

https://gist.github.com/jexp/caeb53acfe8a649fecade4417fb8876a

_cypher-tips-and-tricks.adoc

== Useful Links

* https://drive.google.com/file/d/0BxyAhHyub2F7RWtCaEFvV0FUTEE/view[Slides: Lesser known features in Cypher - Alex Price]
* https://neo4j.com/blog/neo4j-2-2-query-tuning/[Cypher Query Tuning - Michael]
* https://neo4j.com/blog/tuning-cypher-queries/[Cypher Query Tuning - Mark & Petra]

apoc.date.parse.format.cypher

call apoc.load.json("https://api.github.com/search/repositories?q=apoc") yield value
UNWIND value.items as r
return r.full_name, r.created_at, apoc.date.format(apoc.date.parse(r.created_at,'s',"yyyy-MM-dd'T'HH:mm:ss'Z'"),"s")
order by r.favorites desc

apoc.periodic.iterate.sub-batching.cypher

CALL apoc.periodic.iterate(
"LOAD CSV WITH HEADERS FROM url AS line 
WITH  apoc.coll.partition(collect(line),10000) AS batchesOfLines 
UNWIND batchesOfLines as batch 
RETURN batch",
"UNWIND {batch} AS user 
MERGE (u:User {Email: user.Email}) 
SET u += apoc.map.clean(user,['Email'],null)",
{batchSize: 1, parallel: true})

There are more than three files. show original

MalteR · January 15, 2020, 3:55pm

Hello Paul,
thank you very much for your input. My background why I am asking about big data is because of what some people think big data would be is pretty small datasets like one million entries. For me this starts not from one milion datasets but from 10 milion datasets the least.
Can I do something a constraint on label (node_type) dynamically? Is there any way to do that loaded from file within the match-part of the cypher query? See attached my labels that would have to be constrained by column of a file if possible.

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///relationship_officeraa' AS row
FIELDTERMINATOR ';'
MATCH (m),(n)
WHERE m.node_id = trim(row.start_id) and n.node_id = trim(row.end_id) and n.source = trim(row.source) and m.source = trim(row.source)
MERGE (m)-[r:OFFICER_OF {source: row.source, status: row.status, start_date: row.start_date, end_date: row.end_date, type: row.links }]-(n)
return  r;

Thank you very much in advance.
Kindest Regards,
Malte

paulare · January 15, 2020, 4:38pm

MalteR:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///relationship_officeraa' AS row
FIELDTERMINATOR ';'
MATCH (m),(n)
WHERE m.node_id = trim(row.start_id) and n.node_id = trim(row.end_id) and n.source = trim(row.source) and m.source = trim(row.source)
MERGE (m)-[r:OFFICER_OF {source: row.source, status: row.status, start_date: row.start_date, end_date: row.end_date, type: row.links }]-(n)
return  r;

Test this. (You need APOC)

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///import.csv' AS row
FIELDTERMINATOR ';'
	CALL apoc.cypher.run('
	MATCH 
	(m:`'+row.source+'` {node_id: $MNodeID}), 
	(n:`'+row.source+'` {node_id: $NNodeID})
	RETURN m, n'
	, {MNodeID: trim(row.start_id),NNodeID: trim(row.end_id)})	
	YIELD value
WITH value.m as m, value.n as n, row
MERGE (m)-[r:OFFICER_OF {source: row.source, status: row.status, start_date: row.start_date, end_date: row.end_date, type: row.links }]-(n)
return  r;

paulare · January 15, 2020, 4:46pm

@MalteR in addition to using row.source as a label. Maybe your data allows you add some labels to (m) and (n) nodes. If they are different types, then it would improve performance more.

mdfrenchman · January 16, 2020, 2:46pm

Hi @MalteR,

You've created a good composite index. The problem is that it's not being used in your match. The match needs the label portion to utilize the index.

Try replacing MATCH (m),(n) with MATCH (m:Address), (n:Address)

Also as a matter of clarity you can structure it like below (dropping the where):

MATCH (m:Address {node_id:trim(row.start_id), source: trim(row.source)})
MATCH (n:Address {node_id:trim(row.end_id), source: trim(row.source)})

Hope that helps,
-Mike

MalteR · March 12, 2020, 4:25pm

Thank you very much for this hint, it worked when I added the labels.
Thank you very much!

Topic		Replies	Views
hello guys, I have a problem when I want to load this query, it takes a very long time, up to 2 hours, does anyone know why? Cypher performance , cypher , operations , import	2	90	June 6, 2024
Load-CSV very slow with millions of nodes Import / Export load-csv , import , neo4j-import , csv , neo4j	10	11519	April 7, 2022
LOAD CSV taking time Import / Export cypher , import	6	702	September 25, 2021
Importing relationships from multiple csv file Import / Export performance , load-csv	12	3195	June 5, 2020
How long should it take to create several millions of relationships? Cypher	3	346	March 5, 2021

Get Certified in June!

Loading Neo4J Relationships takes literally ages

Related topics