cancel
Showing results for 
Search instead for 
Did you mean: 

Best way to load large CSV files and link the created nodes into a chain of events

DMF194
Node Clone

Hi all,

It is me again ,
I have some trouble loading a dataset that has over 10 million rows.

For now this is how my query looks like,

:auto USING PERIODIC COMMIT 5500
//Load the has_positions table
LOAD CSV WITH HEADERS
FROM 'file:///has_positions.csv'
as row 
merge (p:POSITIONS {id: toInteger(row.device_id)})
ON CREATE SET 
p.position_id=row.position_id,
p.device_id=row.device_id,
p.latitude=row.latitude,
p.longitude=row.longitude,
p.time=row.time

but it is taking more than 2 hours, I have added the following constraint,

CONSTRAINT ON (positions:POSITIONS) ASSERT (positions.device_id) IS UNIQUE

My desired outcome would be to have all my POSITIONS nodes in an interlinked chain of events sequenced by time. But I have no idea how to accomplish it

Any suggestions would be helpful

1 ACCEPTED SOLUTION

abk
Neo4j
Neo4j

Hi @DMF194 ,

MERGE, SET, and CONSTRAINT are using inconsistent property names and values for device_id.

In the MERGE you are using the property name id with the value of row.device_id converted to an integer. Probably this should be changed to MERGE (p:POSITIONS {device_id: toInteger(row.device_id)}).

In the SET you are setting p.device_id to row.device_id as a string (without conversion to an integer). Probably this should be changed to p.device_id = toInteger(row.device_id)

In the CONSTRAINT you are using device_id. Probably that is correct.

Try that.

Best,
ABK

View solution in original post

9 REPLIES 9

abk
Neo4j
Neo4j

Hi @DMF194 ,

MERGE, SET, and CONSTRAINT are using inconsistent property names and values for device_id.

In the MERGE you are using the property name id with the value of row.device_id converted to an integer. Probably this should be changed to MERGE (p:POSITIONS {device_id: toInteger(row.device_id)}).

In the SET you are setting p.device_id to row.device_id as a string (without conversion to an integer). Probably this should be changed to p.device_id = toInteger(row.device_id)

In the CONSTRAINT you are using device_id. Probably that is correct.

Try that.

Best,
ABK

Hi @abk ,

Solved, I was definitely not consistent with my naming of property labels.

It took only 8 minutes to run for 14 million rows

In addition, is there any general examples of how to connect nodes in sequence, ordered by datetime ?

Any suggestions, would be appreciated

@DMF194

regarding connecting nodes in sequence ordered by datetime, I dont see where there is a date time value in your MERGE.
But also I have had the same concern for example loading similar to these posts here at community.neo4j.com into Neo4j where a single post might have N number of replies and wanting to build out a sequence for example

(n:Post {id1}) --> (n:Reply {id:1}) --> (n:Reply {id:2}) --> (n:Reply {id:3}) --> (n:Reply {id:4})

but choose not to build out a N relationship path but rather simply had

(n:Post {id:1}) --> (n:Reply {id:1})
(n:Post {id:1}) --> (n:Reply {id:2})
(n:Post {id:1}) --> (n:Reply {id:3})
(n:Post {id:1}) --> (n:Reply {id:4})

and then ran cypher and used match (n:Post {id:1})-[]-(n2:Reply) order by n2.datetime so as to get the order

glilienfield
Ninja
Ninja

Substitute your date attribute for the n.order attribute to enforce the sort order you want and change the relationship type to yours. This will chain them by earliest to latest dates. Add DESC to the order by clause if you want them from latest to earliest. Also, add a where clause to the leading match clause if you don't want all the nodes chained together.

match (n)
with n
order by n.order
with collect(n) as list
with list, range(0, size(list)-2) as indexes
unwind indexes as i
match (x) where id(x)=id(list[i])
match (y) where id(y)=id(list[i+1])
merge (x)-[:RELATIONSHIP_TYPE]->(y)

mengjiakang2017
Node Clone

@DMF194 I found below query loads fast for me - posting here in case helpful -

call apoc.periodic.iterate('call apoc.load.csv("file:///has_positions.csv") 
yield map as row
return row
','
MERGE, SET queries', {batchSize:10000, iterateList:True, parallel:True})

some other documentation is here - apoc.periodic.iterate - APOC Documentation

@mengjiakang2017 Thanks, appreciate the additional resources for importing large CSV files

Hi @mengjiakang2017,

I just tried out your apoc.periodic .iterate but I get the following error

There is no procedure with the name apoc.periodic.iterate registered for this database instance. Please ensure you've spelled the procedure name correctly and that the procedure is properly deployed.

I am unable to search for resources on this not even the docs in Neo4j clearly explain

Hi @DMF194 ,

APOC is a commonly used Neo4j plugin which provides many super useful functions and procedures. It must be installed and enabled on the Neo4j DBMS.

For details, see Awesome Procedures On Cypher (APOC) - Neo4j Labs

Best,
ABK

@abk

Thanks for the reply, the issue was due to the upgrading process from graph database from 4.3.5 to higher version.

I have uninstalled my neo4j desktop and reinstalled