Dear community, dear Michael,
I have managed to load the nodes with LOAD CSV.
The issue that I am struggeling right now with is creation of the relationships accordingly. Is this a thing which is done quicker with Java due to the internal structure of neo4j?
Thank you very much.
Kindest Regards,
Malte
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///relationship_officeraa' AS row
FIELDTERMINATOR ';'
MATCH (m),(n)
WHERE m.node_id = trim(row.start_id) and n.node_id = trim(row.end_id) and n.source = trim(row.source) and m.source = trim(row.source)
MERGE (m)-[r:OFFICER_OF {source: row.source, status: row.status, start_date: row.start_date, end_date: row.end_date, type: row.links }]-(n)
return r;
Maybe your import takes ages because your query tries to take a too big bite at once. Currently, your limit is set to 1000.
Second thing i noticed, that your MATCH / MERGE clauses don't have labels specified. When you use MATCH (m) - it will read through the whole graph - it quite costly.
Hi Paul,
thank you very much for your reply.
Can you give me any advice how to get here less costly, actually with regards to labels, I guess you mean type of node, right? I might be able to specify. So my scenario would most likely work faster if I specified the node types and then execute on the database? What's your background on big data size wise?
Kindest Regards,
Malte
First, maybe you can describe a bit your model and data bit deeper.
From the relationship name "OFFICER_OF", I would suggest adding Officer label to (m) nodes.
After reviewing your graph model and adding labels try to run the queries like
PROFILE MATCH (m) RETURN m
versus
PROFILE MATCH (m: Officer) RETURN m
You should see the cost difference of these queries
@MalteR
I am using graphs to store identity access history, the current graph is at the moment not too big and I am using Community Edition, but I believe by the end of 2020 it will grow to a few million nodes - as I add more data sources.
I had a similar issue while syncing data from different data sources and my graph started growing. After i added additional labels that corresponded to each data source, I managed to speed up read/write to graph drastically. Some data loading queries that before took hours, finish now in a few minutes. Thing is that you have to keep in mind - how big dataset your query is addressing. Thinking of subgraphs helped me to improve my model, it may help you also.
I suggest reading "Cypher Tips & Tricks" post made by @michael.hunger
There is a lot of good information there
Hello Paul,
thank you very much for your input. My background why I am asking about big data is because of what some people think big data would be is pretty small datasets like one million entries. For me this starts not from one milion datasets but from 10 milion datasets the least.
Can I do something a constraint on label (node_type) dynamically? Is there any way to do that loaded from file within the match-part of the cypher query? See attached my labels that would have to be constrained by column of a file if possible.
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///relationship_officeraa' AS row
FIELDTERMINATOR ';'
MATCH (m),(n)
WHERE m.node_id = trim(row.start_id) and n.node_id = trim(row.end_id) and n.source = trim(row.source) and m.source = trim(row.source)
MERGE (m)-[r:OFFICER_OF {source: row.source, status: row.status, start_date: row.start_date, end_date: row.end_date, type: row.links }]-(n)
return r;
Thank you very much in advance.
Kindest Regards,
Malte
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///import.csv' AS row
FIELDTERMINATOR ';'
CALL apoc.cypher.run('
MATCH
(m:`'+row.source+'` {node_id: $MNodeID}),
(n:`'+row.source+'` {node_id: $NNodeID})
RETURN m, n'
, {MNodeID: trim(row.start_id),NNodeID: trim(row.end_id)})
YIELD value
WITH value.m as m, value.n as n, row
MERGE (m)-[r:OFFICER_OF {source: row.source, status: row.status, start_date: row.start_date, end_date: row.end_date, type: row.links }]-(n)
return r;
@MalteR in addition to using row.source as a label. Maybe your data allows you add some labels to (m) and (n) nodes. If they are different types, then it would improve performance more.