I'm new to Neo4j and working on a data set that contains 2.3M rows. I have got the following questions
When I implement the Load CSV query, I'm not able to load and view the whole dataset.
Do I have to use LIMIT to have an effective visualization?
Query used:
LOAD CSV WITH HEADERS FROM 'file.csv' AS row with toInteger(row.parentid) AS parentid, toInteger(row.childid) AS childid, row.name AS name
RETURN row LIMIT 100;
Working on a CSV file having 3 columns as mentioned above,
I'm trying to create a parent-child relationship between entries in parentid with childid using
In Neo4j Desktop, you can only see the first 1000 rows in table view.
To speed up the load, you must first create UNIQUE CONSTRAINTS on nodes:
CREATE CONSTRAINT child_id IF NOT EXISTS ON (n:childnode) ASSERT n.id IS UNIQUE
CREATE CONSTRAINT parent_id IF NOT EXISTS ON (n:parentnode) ASSERT n.parentid IS UNIQUE
You can use this query after creating the constraints:
USING PERIODIC COMMIT 1000 LOAD CSV WITH HEADERS FROM 'file.csv' AS row
WITH toInteger(row.parentid) AS pid, toInteger(row.childid) AS cid, row.name AS name
MERGE (p:parentnode {parentid: pid, pname: name})-[:PARENT]->(c:childnode {id: cid, cname: name})
Since I was concerned with analyzing the relationships, I didn't notice the time it took to run.
It was working fine for some time and I replied to you,
after that, the query terminated because one of the rows failed the constraint qualification.
I think that I didn't efficiently frame the second part of my question initially.
So I'm trying it once again,
The data I'm dealing with looks similar to the data below
I'm trying to create a relationship between the parent_id and id in which case there can be only one instance of every id.
This will result in a tree structure(node 500 is the parent of id 2 & likewise).
I'm using the query below, also I tried it with LIMIT to check if it matched the requirement. It produced the correct result.
:auto using periodic commit 5000 load csv with headers from 'file.csv' as row
with toInteger(row.parent_id) AS parentid, toInteger(row.id) AS id, row.name AS name
merge (c:childnode {id: id})
on match set c.name = rank
on create set c.name = rank
merge (parent:parentnode {id: parenttaxid})
merge (c)-[:PARENT]->(parent)
return (c)-[:PARENT]->(parent)
I came to know this query(posted 2yrs ago) when I searched for a similar scenario in this community.
So for every 5 minutes, the node label gets updated by 5000. For 2.3 million rows, it will approximately take more than 20 hours
if I'm correct.
Is this normal with Neo4j?
Or Am I committing some silly mistakes?
You put UNIQUE CONSTRAINTS on childnode and parentnode and the query looks optimize so everything good. Now something you can do is to increase the RAM of your database
You can use this tool to get the values of parameters based on the RAM of your machine.
Thanks, the query plan shows three things that are contributing to the slowness.
First, we see that label scans are being used for both parent and c, so you're missing indexes, making these lookups extremely expensive (it's scanning all :childnodes and all :parentnodes for every row, not just once).
A followup question here...what is the difference between a :childnode type and a :parentnode type? If these are supposed to be the same type and the only difference is the relationship they have to one another (especially if a child node can be the parent node of another), then this the wrong way to model your data. If they are of the same type, then you should use the same label for both, and let the relationships you create between them handle the relationship they have to each other, not with the label. If you do change the labels for your data, make sure you have an index or a unique constraint created so your MATCH and MERGE operations on those nodes is fast.
The other thing going on is that there is an Eager operator in the plan, and that's sabotaging the periodic commit behavior. We need to get rid of the Eager here, but first deal with the above issues.
I get the problem now. Yes, you are correct, they are supposed to be the same type. A child node can also be a parent of another. I will use the label childnode for both parent and c.
:auto using periodic commit 5000 load csv with headers from 'file:///node.csv' as row
with toInteger(row.parentid) AS parentid, toInteger(row.id) AS id, row.rank AS rank
merge (c:childnode {id: id})
on match set c.name = rank
on create set c.name = rank
merge (parent:childnode {id: parentid})
merge (c)-[:PARENT]->(parent)
This query now takes approximately 1 minute to add 5000 nodes which is significantly fast than the previous query. I hope this query is as per your suggestion and is this rate ok for the processing of the query?
No, 1 minute for 5000 nodes is far too slow. This should be subsecond.
You didn't mention if you added indexes or not. If you didn't, please do, and then verify with an EXPLAIN of the query.
We also know there is an Eager in the query plan, and that will sabotage the periodic committing. The reason is the two MERGE operations.
We would recommend doing two passes through your CSV first, one to only MERGE c, another pass to only MERGE parent, and a last pass to MATCH (instead of MERGE) both nodes and MERGE the relationship between them. That will remove the Eager.
I've created an index on c.id and the query was able to load 2,317,840 nodes within 2-3 minutes. Then I also followed the passes and eliminated the Eager operator. Finally, 2317840 relationships, got completed after 164007 ms.