LOAD CSV WITH HEADERS FROM 'file:///records.csv' AS freq
match f:sFreq) where f.slfreq=freq.fr
match (d:Date) where d.date=date(freq.dt)
CREATE (fr)-[r:STATISTICS_ON]->(d)
I created index with followng two queries---
CREATE INDEX unique_id_index FOR (n:sFreq) ON (n.fr)
CREATE INDEX unique_id_index1 FOR (n:Date) ON (n.dt)
I have Date and sFreq nodes. I want to create relationships based on matching frequencies and date from csv.
CSV is having huge records (4800000 rows) and query is running for more than 8 hours now.
Is there any way to reduce this time. I have multiple csv files with which I need to create relationships in same database.
According to the query, you are searching on properties 'slfreq' and 'date', not 'fr' and 'dt'. The properties 'fr' and 'dt' and the column headings in your csv file. As such, I think you want the flowing indexes:
CREATE INDEX unique_id_index FOR (n:sFreq) ON (n.slfreq)
CREATE INDEX unique_id_index1 FOR (n:Date) ON (n.date)
That may help.
You can see what indexes are being used by looking at the query plan.
Actually, I can write the query to group the dates by frequency, so you don't have to search for the same frequency many times. that may help speed things up.
try this. I collected the dates by frequency in order to do the grouping. Let's see what happens to the memory requirements with the many rows you have:
LOAD CSV WITH HEADERS FROM 'file:///records.csv' AS freq
with freq.fr as freq_fr, collect(distinct freq.dt) as freq_dates
match (f:sFreq{slfrq:freq_fr})
with f, freq_dates
unwind freq_dates as freq_date
match (d:Date{date:date(freq_date)})
create (f)-[r:STATISTICS_ON]->(d)
It's also possible the grouping adds time to the query, negating any benefit. What the query avoids is matching on the same frequency node more than once.
Can you run the following to ensure there are not duplicate rows?
LOAD CSV WITH HEADERS FROM 'file:///records.csv' as freq
WITH freq.freq as sfreq, freq.new_date as fdate, freq.mean as fmean, count(*) as cnt
WHERE cnt>1
RETURN *
Now issue is in csv I have 98169 rows and above query creating around 2225084 relationships. Ideally it should create one relationship for each row. I checked result with query but it is showing same relationship multiple times as below. There should be one record. Can you please let me what is wrong above query?
Since you are using ‘create’ fir the relationships, it will create a new one each time. Have you executed this script multiple times in an effort to troubleshoot it? If so, that would explain the duplicates. I agree with you that you should get one new rekationship per row in the csv file.
That will not help, as you are not setting properties on nodes n nor m. The syntax on the 'SET' clause is not correct. You don't include the label in the set operations, just the variable and the property.
Are you sure you don't have multiple rows with the same frequency and date combinations? You can change the 'create' to 'merge' so the relationships is not created twice. Just a note merge is not multi-thread safe, so running the batches in parallel could result in multiple relationships if you have multiple combinations of frequency and date.
Also, using parallel:true when creating relationships can result in locking conditions, that may negatively impact performance. True with parallel:false as well.