Dear all,
I want to merge some data from csv file into neo4j(v3.5.9). Because the label is defined in csv dynamically, the apoc is used to achieve it. After import the entities, then I import the relationships as below:
CALL apoc.load.csv($csvFile, {skip:0, header:true}) yield map AS relRow
WITH relRow, relRow['type'] as type,relRow['from_entity_type'] as from_type,relRow['to_entity_type'] as to_type
CALL apoc.cypher.doIt('MATCH (source:' + from_type + '{entity_id:row[\"from_entity_id\"],job_id:row[\"job_id\"]}),(target:' + to_type + '{entity_id:row[\"to_entity_id\"],job_id:row[\"job_id\"]})
WITH source,target
Merge (source)-[r:' + type + ']->(target)
set r=$row,r.type=null,r.from_entity_type=null,r.to_entity_type=null,r.from_entity_id=null,r.to_entity_id=null,
r.is_suspected=(case when r.is_suspected=\"t\" then true else false end),
r.is_effective=(case when r.is_effective=\"t\" then true else false end)', {row:relRow}) yield value AS v return v
The problem is that it works pretty slow somtimes(more than 2 hours), even for CSV files with less than 80000 entries. However, it works very fast at the second time(less than 1min) with the same csv file and same scripts. Of course, index has been created for those properties of all entities.
Is there any problem in my scripts? It really bothers me for a long time. Any suggestions will be appreciated.
Hmm I'd say that the slowness might be because of it doing so many calls to apoc.cypher.doIt but I'm not sure. Do you have a dummy CSV file that you can share so I can play around with the query?
Because I cannot export data from the production environment, I created serveral csv files, including two files for entity, and one for relationship, with only few records.In fact, e_company.csv contains less than 100k, e_email.csv contains less than 1k and r_use.csv contains less than 80k records. The dummy files are listed as below,
please import the entity first,then we can check the script as I mentioned for relationship import. It is a remarkable fact that the problem is difficult to be reproduced with same csv files and script in the production environment.
In addition, there are many calls to apoc.cypher.doIt in other place, but it's OK. It really made me confused.
I created index for specified properties, then all of data will be deleted after a job done, will the index will be affected at the next job in some way? I felt very doubtful about this although the execution plan shows the index works. Because I found the script always worked well in the neo4j instance with some data.
When you run it and there is data there already the MERGE statement wouldn't actually be creating a relationship because it already exists. Or did you mean that it even creates stuff that doesn't exist faster as well?
I've generated a file to play around with it. This file is gzipped (rels.csv.gz) but had to rename it so that I'm allowed to upload it:
load csv with headers from "file:///rels.csv" AS relRow
with relRow
WITH relRow,
relRow['relation_type'] as type,
relRow['from_entity_type'] as from_type,
relRow['to_entity_type'] as to_type
CALL apoc.cypher.doIt(
'MERGE (source:' + from_type + '{entity_id:row[\"from_entity_id\"],job_id:row[\"job_id\"]})
MERGE (target:' + to_type + '{entity_id:row[\"to_entity_id\"],job_id:row[\"job_id\"]})
WITH source,target
MERGE (source)-[r:' + type + ']->(target)
set r=$row,
r.relation_type=null, r.from_entity_type=null, r.to_entity_type=null, r.from_entity_id=null, r.to_entity_id=null,
r.is_suspected=(case when r.is_suspected=\"t\" then true else false end),
r.is_effective=(case when r.is_effective=\"t\" then true else false end)',
{row:relRow})
yield value AS v return v
When I ran it on node labels with no index (by mistake) I found it was taking forever, so could it be there are some files with some different unindexed labels? Or maybe the index is not online yet?
Thank you for your advices, I rechecked the index, status of all index is online. I did not check if some files have some different unindex lablels. I will check it if I can touch the production environment. One question, If it is as you said, the script run with same csv at the second time will be slowly. But it's ok actually.
I found the rels.csv in production environment contains about 80k entries and its storage is 40MB。
You can test whether there are indexes by running this query:
load csv with headers from "file:///rels.csv" AS relRow
WITH relRow['from_entity_type'] as from_type,
relRow['to_entity_type'] as to_type
UNWIND [from_type, to_type] AS type
WITH type, count(*) AS count
RETURN type, count, apoc.cypher.runFirstColumn(
'CALL db.indexes()
YIELD labelsOrTypes, properties, state
WHERE $label in labelsOrTypes
RETURN {labels: labelsOrTypes, properties: properties, state: state}',
{label: type}
) AS indexes
I called db.indexes() in my neo4j3.5.9, the labelsOrType will not returned as the api changed. So the script you provided may not be supported in my neo4j instance. I took the details of indexes by calling db.indexes(). The details is what the file show, indexes.txt (602 Bytes)
What makes me confused is that the value for indexname shows "unnamed index" ? Is it will affect the work of index? The problem now is the same job with same data work well at most of time, but it runs for a very long time sometime.
No entries like you said in the the rels.csv. And I made some rows that aren't mail or company in the rels.csv(by mistake), it seems to work well. Such as several records like this