Best way to load large CSV files and link the created nodes into a chain of events

DMF194 · December 7, 2021, 7:41am

Hi all,

It is me again ,
I have some trouble loading a dataset that has over 10 million rows.

For now this is how my query looks like,

:auto USING PERIODIC COMMIT 5500
//Load the has_positions table
LOAD CSV WITH HEADERS
FROM 'file:///has_positions.csv'
as row 
merge (p:POSITIONS {id: toInteger(row.device_id)})
ON CREATE SET 
p.position_id=row.position_id,
p.device_id=row.device_id,
p.latitude=row.latitude,
p.longitude=row.longitude,
p.time=row.time

but it is taking more than 2 hours, I have added the following constraint,

CONSTRAINT ON (positions:POSITIONS) ASSERT (positions.device_id) IS UNIQUE

My desired outcome would be to have all my POSITIONS nodes in an interlinked chain of events sequenced by time. But I have no idea how to accomplish it

Any suggestions would be helpful

akollegger1 · December 7, 2021, 2:29pm

Hi @DMF194 ,

MERGE, SET, and CONSTRAINT are using inconsistent property names and values for device_id.

In the MERGE you are using the property name id with the value of row.device_id converted to an integer. Probably this should be changed to MERGE (p:POSITIONS {device_id: toInteger(row.device_id)}).

In the SET you are setting p.device_id to row.device_id as a string (without conversion to an integer). Probably this should be changed to p.device_id = toInteger(row.device_id)

In the CONSTRAINT you are using device_id. Probably that is correct.

Try that.

Best,
ABK

DMF194 · December 8, 2021, 1:58am

Hi @abk ,

Solved, I was definitely not consistent with my naming of property labels.

It took only 8 minutes to run for 14 million rows

In addition, is there any general examples of how to connect nodes in sequence, ordered by datetime ?

Any suggestions, would be appreciated

dana_canzano · December 8, 2021, 2:45am

@DMF194

regarding connecting nodes in sequence ordered by datetime, I dont see where there is a date time value in your MERGE.
But also I have had the same concern for example loading similar to these posts here at community.neo4j.com into Neo4j where a single post might have N number of replies and wanting to build out a sequence for example

(n:Post {id1}) --> (n:Reply {id:1}) --> (n:Reply {id:2}) --> (n:Reply {id:3}) --> (n:Reply {id:4})

but choose not to build out a N relationship path but rather simply had

(n:Post {id:1}) --> (n:Reply {id:1})
(n:Post {id:1}) --> (n:Reply {id:2})
(n:Post {id:1}) --> (n:Reply {id:3})
(n:Post {id:1}) --> (n:Reply {id:4})

and then ran cypher and used match (n:Post {id:1})-[]-(n2:Reply) order by n2.datetime so as to get the order

glilienfield · December 9, 2021, 4:52pm

Substitute your date attribute for the n.order attribute to enforce the sort order you want and change the relationship type to yours. This will chain them by earliest to latest dates. Add DESC to the order by clause if you want them from latest to earliest. Also, add a where clause to the leading match clause if you don't want all the nodes chained together.

match (n)
with n
order by n.order
with collect(n) as list
with list, range(0, size(list)-2) as indexes
unwind indexes as i
match (x) where id(x)=id(list[i])
match (y) where id(y)=id(list[i+1])
merge (x)-[:RELATIONSHIP_TYPE]->(y)

mengjiakang2017 · December 9, 2021, 5:50pm

@DMF194 I found below query loads fast for me - posting here in case helpful -

call apoc.periodic.iterate('call apoc.load.csv("file:///has_positions.csv") 
yield map as row
return row
','
MERGE, SET queries', {batchSize:10000, iterateList:True, parallel:True})

some other documentation is here - https://neo4j.com/labs/apoc/4.2/overview/apoc.periodic/apoc.periodic.iterate/

DMF194 · December 13, 2021, 8:00am

@mengjiakang2017 Thanks, appreciate the additional resources for importing large CSV files

DMF194 · December 17, 2021, 2:43am

Hi @mengjiakang2017,

I just tried out your apoc.periodic .iterate but I get the following error

There is no procedure with the name apoc.periodic.iterate registered for this database instance. Please ensure you've spelled the procedure name correctly and that the procedure is properly deployed.

I am unable to search for resources on this not even the docs in Neo4j clearly explain

akollegger1 · December 17, 2021, 3:48pm

Hi @DMF194 ,

APOC is a commonly used Neo4j plugin which provides many super useful functions and procedures. It must be installed and enabled on the Neo4j DBMS.

For details, see Awesome Procedures On Cypher (APOC) - Neo4j Labs

Best,
ABK

DMF194 · December 22, 2021, 6:15am

@abk

Thanks for the reply, the issue was due to the upgrading process from graph database from 4.3.5 to higher version.

I have uninstalled my neo4j desktop and reinstalled

Topic		Replies	Views
Creating relationship using MERGE runs endlessly Cypher	12	688	May 12, 2021
Slow load_csv Cypher cypher	4	2074	July 29, 2019
Incredibly Long Load CSV Using Desktop Import / Export	1	221	March 20, 2023
LOAD CSV taking time Import / Export cypher , import	6	755	September 25, 2021
Fastest way to load data in neo4j using python Cypher	5	10574	May 5, 2021

Take the Course Then Join The Aura Agent Hackathon

Best way to load large CSV files and link the created nodes into a chain of events

Related topics

Take the Course Then Join
The Aura Agent Hackathon