Query taking a very long time to create relationships

vnagpure · February 4, 2023, 7:03pm

I am running following query.

LOAD CSV WITH HEADERS FROM 'file:///records.csv' AS freq
match f:sFreq) where f.slfreq=freq.fr
match (d:Date) where d.date=date(freq.dt)
CREATE (fr)-[r:STATISTICS_ON]->(d)
I created index with followng two queries---

CREATE INDEX unique_id_index FOR (n:sFreq) ON (n.fr)
CREATE INDEX unique_id_index1 FOR (n:Date) ON (n.dt)

I have Date and sFreq nodes. I want to create relationships based on matching frequencies and date from csv.
CSV is having huge records (4800000 rows) and query is running for more than 8 hours now.

Is there any way to reduce this time. I have multiple csv files with which I need to create relationships in same database.

Regards
Vaishali

vnagpure · February 4, 2023, 9:45pm

That was typo. I used slfreq and date. As I was trying various options I pasted wrong line here.

Regards

Vaishali

glilienfield · February 4, 2023, 9:29pm

According to the query, you are searching on properties 'slfreq' and 'date', not 'fr' and 'dt'. The properties 'fr' and 'dt' and the column headings in your csv file. As such, I think you want the flowing indexes:

CREATE INDEX unique_id_index FOR (n:sFreq) ON (n.slfreq)
CREATE INDEX unique_id_index1 FOR (n:Date) ON (n.date)

That may help.

You can see what indexes are being used by looking at the query plan.

https://neo4j.com/docs/cypher-manual/current/query-tuning/query-profile/

glilienfield · February 4, 2023, 9:55pm

How is your csv file organized? Can you group the data so that you can one frequency at a time with all its date/

Can you share the query plan?

glilienfield · February 4, 2023, 10:19pm

Actually, I can write the query to group the dates by frequency, so you don't have to search for the same frequency many times. that may help speed things up.

vnagpure · February 5, 2023, 12:16am

Can you explain how can I write this? Grouping? I have same frequencies on multiple dates.

Regards

Vaishali

glilienfield · February 5, 2023, 10:09pm

try this. I collected the dates by frequency in order to do the grouping. Let's see what happens to the memory requirements with the many rows you have:

LOAD CSV WITH HEADERS FROM 'file:///records.csv' AS freq
with freq.fr as freq_fr, collect(distinct freq.dt) as freq_dates
match (f:sFreq{slfrq:freq_fr})
with f, freq_dates
unwind freq_dates as freq_date
match (d:Date{date:date(freq_date)})
create (f)-[r:STATISTICS_ON]->(d)

It's also possible the grouping adds time to the query, negating any benefit. What the query avoids is matching on the same frequency node more than once.

glilienfield · February 6, 2023, 4:03pm

Can you run the following to ensure there are not duplicate rows?

LOAD CSV WITH HEADERS FROM 'file:///records.csv' as freq
WITH freq.freq as sfreq, freq.new_date as fdate, freq.mean as fmean, count(*) as cnt
WHERE cnt>1
RETURN *

vnagpure · February 6, 2023, 4:10pm

I got this.

vnagpure · February 6, 2023, 4:16am

I tried below query and its running fast comparatively.

CALL apoc.periodic.iterate(

"LOAD CSV WITH HEADERS FROM 'file:///records.csv' as freq RETURN freq",

"WITH freq.freq as sfreq, freq.new_date as fdate, freq.mean as fmean

match f:sFreq) where f.slfreq=sfreq

match (d:Date) where d.date=date(fdate)

CREATE (f)-[:STATISTICS_ON{mean:toFloat(fmean)}]->(d)",{batchSize:10000, parallel:true})

Now issue is in csv I have 98169 rows and above query creating around 2225084 relationships. Ideally it should create one relationship for each row. I checked result with query but it is showing same relationship multiple times as below. There should be one record. Can you please let me what is wrong above query?

44.475 │-108.16310501098631│{"date":"2017-01-01"}│
├────────────────┼───────────────────┼─────────────────────┤
│44.475 │-108.16310501098631│{"date":"2017-01-01"}│
├────────────────┼───────────────────┼─────────────────────┤
│44.475 │-108.16310501098631│{"date":"2017-01-01"}│
├────────────────┼───────────────────┼─────────────────────┤
│44.475 │-108.16310501098631│{"date":"2017-01-01"}│
├────────────────┼───────────────────┼─────────────────────┤
│44.475 │-108.16310501098631│{"date":"2017-01-01"}│

glilienfield · February 6, 2023, 5:44am

Since you are using ‘create’ fir the relationships, it will create a new one each time. Have you executed this script multiple times in an effort to troubleshoot it? If so, that would explain the duplicates. I agree with you that you should get one new rekationship per row in the csv file.

vnagpure · February 6, 2023, 6:17am

No I am running only once. Also I tried using MERGE with parallel as false. But still its creating duplicate relationships.

vnagpure · February 6, 2023, 1:53pm

I saw one post saying following can avoid creating multiple relationships.

MATCH (n), (m) WHERE ID(n) = {id1} AND ID(m) = {id2}
SET n.locked = true, m.locked = true
MERGE (n)-[r:TEST]->(m)
SET n.locked = Null, m.locked = Null

What will be my query like this?

SET (f:sfreq).locked = true, (d:Date).locked = true --- gives error

glilienfield · February 6, 2023, 3:24pm

That will not help, as you are not setting properties on nodes n nor m. The syntax on the 'SET' clause is not correct. You don't include the label in the set operations, just the variable and the property.

Are you sure you don't have multiple rows with the same frequency and date combinations? You can change the 'create' to 'merge' so the relationships is not created twice. Just a note merge is not multi-thread safe, so running the batches in parallel could result in multiple relationships if you have multiple combinations of frequency and date.

glilienfield · February 6, 2023, 3:37pm

Also, using parallel:true when creating relationships can result in locking conditions, that may negatively impact performance. True with parallel:false as well.

vnagpure · February 7, 2023, 12:13am

I can see new issue now. I have splitted CSV with small number of records.

mean

freq

occup

new_date

-108.163

44.475

0.813008

1/1/2017

-108.433

44.478

0.813008

1/1/2017

-107.406

44.481

0.813008

1/1/2017

-108.059

44.484

0.813008

1/1/2017

-107.676

44.487

0.813008

1/1/2017

-107.018

44.577

0.813008

1/1/2017

-107.148

44.58

0.813008

1/1/2017

1/2/2017

-108.664

50.28

0.813008

1/3/2017

-108.369

50.283

0.813008

1/3/2017

-108.856

50.286

0.813008

1/3/2017

Now if I create relationship with following query

LOAD CSV WITH HEADERS FROM 'file:///201701.csv' AS freq

WITH freq.freq AS ffreq, freq.new_date AS fdate

MATCH (fr:sFreq {sfreq :toFloat(ffreq)})

MATCH (d:Date {date :date(fdate)})

MERGE (fr)-[:STATISTICS_ON]->(d)

For date 2017-01-01 it is creating total of 203 relationships. Actually it should create only 7 relationships with same date node.

Can you please let me know what is wrong now?

Regards

Vaishali

vnagpure · February 7, 2023, 12:58am

I think I got the error. I have duplicate date nodes which is the cause of the problem.

Thanks for all your help. Will get back if any new issue arise.

Regards

Vaishali

Topic		Replies	Views
LOAD CSV taking time Import / Export cypher , import	6	694	September 25, 2021
How long should it take to create several millions of relationships? Cypher	3	341	March 5, 2021
Load CSV taking excessive amount of time Neo4j Graph Platform	2	486	February 22, 2023
Loading Neo4J Relationships takes literally ages Cypher	9	1298	March 12, 2020
CSV import issue Import / Export	26	630	June 21, 2023

Query taking a very long time to create relationships

Related topics