I'm sure there is a more efficient way of doing what I'm doing. In short, I have a PHP script to load a CSV into my Neo4j database. I am pretty sure it can be done with LOAD CSV however I'm from a PHP / MySQL legacy and still feel like a fish out of water with graph databases and I just wanted to use what I know... however it's taken 3 days so far to import a 2600 rows CSV... all be it 365 columns per row.
So here is the scenario.
this CSV is the result of 365 days of logs for 2600 sensors. CSV logs something like:-
sensorID, 2020-01-01 00:00:00, 2020-01-02 00:00, etc till 2020-12-31 00:00
123-01,10,11,10,etc
123-02,3,6,8 for 2600 sensors
My PHP creates the graph like this
Check to see if the sensor node exists if it doesn't it creates this node, creates a data node and related the two
it then goes on to create a node for each day with the data and relates this to the data node.
so you end up with
I hope I've made that clear but I'll try it a different way just in case.
I have a Sensor node that contains the senor ID
related to that I have another node (just to attach the data too) that I relate all the data to, one node per day. These nodes and relationships need to be created using the header info as I don't know until the CSV files get uploaded what period it will cover, that's also why I need to check if the sensor already exits and only add the new data.
I am pretty sure I should be using LOAD CSV but I just can't work it out and am kind of feeling I should have stuck to using MySQL, I just know going forward that this is the right way.
Thank you for your time to read this.
Can you share the code that you've written to do all this and maybe a dummy version of the CSV files that contains a few rows...and then I'll try to write a LOAD CSV version of the query?
Kind of…. There’s a node in between the SENSOR node and the data nodes… There will be other data sources for each of the sensor nodes so I wanted to isolate them a bit… hang on
I think the query below might be close to what you want:
load csv with headers from "file:///sensor_data_2020_trunk.txt" AS row
WITH row, row.SENSOR AS sensor, [key in keys(row) WHERE key <> 'SENSOR' AND not(row[key] is null)] AS keys
MERGE (s:Sensor {id: sensor})
WITH s, row, keys
CALL {
WITH s, row, keys
UNWIND keys AS key
MERGE (n:NbFeed {id: key})
MERGE (s)-[feed:HAS_NBFEED]->(n)
SET feed.value = row[key]
RETURN count(*)
}
RETURN count(*)
I'm marking your solution Mark. Its not quite but I think I can work it out with what you have done. My I realise that I changed the name on the middle node which has confused matters.
The biggest plus is it imports this csv in minuets not days.
Speed that up with indexing nodes. If you have at least one constraint on a node type mapped to a primary key. Set the constraint using cypher on that prop, and it gets faster.
Thanks, Brandon, care to explain more? I know what indexing is I use it all the time but (and I'm probably nieve here) isn't the purpose of a graph db that you don't need to index?.. i didn't know it was an option but I have been pushed in the deep end and trying to figure this out whilst putting a project together. In the example above the sensor is the id's are what I would put a primary key on if it was say, MySQL or MS Access.
Appreciate the input Thank you
Having indexes allows neo to quickly lookup their array index for specific nodes in memory.
In neo, setting a node property constraint will automatically create an index for that property in the DB for you. Every query for 1 specific node ( be it a search or import merge search) will scan every node to verify how many match your query. If you have 750 nodes, every record has at least 750 DB hits when you don’t have a constraint. Set a constraint and that’s down to 1-3.
You can verify this if you use the explain and profile tools in the neo4j browser.
From my experience, you should be able to import those nodes in a matter of seconds.