How do I use LOAD CSV for this scenario

Hello,

I'm sure there is a more efficient way of doing what I'm doing. In short, I have a PHP script to load a CSV into my Neo4j database. I am pretty sure it can be done with LOAD CSV however I'm from a PHP / MySQL legacy and still feel like a fish out of water with graph databases and I just wanted to use what I know... however it's taken 3 days so far to import a 2600 rows CSV... all be it 365 columns per row.
So here is the scenario.
this CSV is the result of 365 days of logs for 2600 sensors. CSV logs something like:-
sensorID, 2020-01-01 00:00:00, 2020-01-02 00:00, etc till 2020-12-31 00:00
123-01,10,11,10,etc
123-02,3,6,8 for 2600 sensors
My PHP creates the graph like this

Check to see if the sensor node exists if it doesn't it creates this node, creates a data node and related the two
it then goes on to create a node for each day with the data and relates this to the data node.
so you end up with

[Sensor_node] - HAS_DATA -> [Sensor_data]
[Sensor_data]- DATA_VALUES_SENSOR_2020-01-01->[Data]

I hope I've made that clear but I'll try it a different way just in case.
I have a Sensor node that contains the senor ID
related to that I have another node (just to attach the data too) that I relate all the data to, one node per day. These nodes and relationships need to be created using the header info as I don't know until the CSV files get uploaded what period it will cover, that's also why I need to check if the sensor already exits and only add the new data.
I am pretty sure I should be using LOAD CSV but I just can't work it out and am kind of feeling I should have stuck to using MySQL, I just know going forward that this is the right way.
Thank you for your time to read this.

Can you share the code that you've written to do all this and maybe a dummy version of the CSV files that contains a few rows...and then I'll try to write a LOAD CSV version of the query?

1 Like

Certainly, This is my php:-

<?php
ini_set('display_errors', 1);
ini_set('display_startup_errors', 1);
error_reporting(E_ALL);
require "src/autoload.php";
require "../../db/neo4j.inc";
$countline = 0;
$node = "";
$relation = "";
$file = "tobeimported/sensor_data_2020_trunk.csv";

$node = "xxx-xx";
$cypher = "MATCH (n:SENSOR{id:'".$node."'}) RETURN n LIMIT 25";
$res = $bolt->run($cypher);
$records = $bolt->pull();
print_r($records)."<br>";


$file_open = fopen($file,"r");
while(($csv = fgetcsv($file_open, 0, ",")) !== false ) 
	{
		if ($countline == 0) //First Line has the headers
		{
			$headers = $csv;
			//print_r($headers);
		}
		else
		{
			$node = $csv[0];
			$cypher = "MATCH (n:SENSOR{id:'".$node."'}) RETURN n LIMIT 25"; //Has it already got a node?
			//$cypher = "MATCH (n) RETURN n LIMIT 25";
			$res = $bolt->run($cypher);
			$records = $bolt->pull();
			//print_r($records)."<br>";
			if (count($records)==1)
			{
				$cypher = "CREATE (:SENSOR{id:'".$node."'})"; //CREATE the node
				$res = $bolt->run($cypher);
				$records = $bolt->pull(); 
				//We didnt have a SENSOR node so we wont have a NBFEED node so CREATE that node too
				$cypher = "CREATE (:NBFEED{id:'".$node."'})";
				$res = $bolt->run($cypher);
				$records = $bolt->pull();
				$cypher = "MATCH (d:SENSOR{id:'".$node."'}), (n:NBFEED{id:'".$node."'}) MERGE(d)-[h:HAS_NBFEED]->(n) RETURN d.id, type(h), n.id"; 
				//Relate the two nodes
				//echo $cypher;
				$res = $bolt->run($cypher);
				$records = $bolt->pull();
				//print_r($records)."<br>";
				
			}
			for ($i=1,count($headers),$1++)
			{
				$cypher = "CREATE (:NBFEED_DATA{id:'".$header[$i]."',data:'".$csv[$i]."',SENSOR:'".$node."'})";
				$res = $bolt->run($cypher);
				$records = $bolt->pull();
				$cypher = "MATCH (n:NBFEED{id:'".$node."'}), (d:NBFEED_DATA{id:'".$header[$i]."', SENSOR:'".$node."'}) MERGE(n)-[h:HAS_NBFEED_DATA{id:'".$header[$i]."'}]->(d) RETURN d.id, type(h), n.id"; 
				//Relate the two nodes
				echo $cypher;
				$res = $bolt->run($cypher);
				
				
			}


		}
		$countline++;
	}

fclose ($file_open);
?>

Datasensor_data_2020_trunk.txt (12.5 KB)

Extention will need to be renamed to .csv

Thank you, I'm sure once I have this I'll be able to get my head around the rest.

Thank you Mark
I have done that.

So just to check that I know what you want to do. The idea is:

  • Create one node per sensor (which means one senros node per row)
  • Create one feed node per column and connect those feed nodes to the sensor nodes.

Is that right?

Kind of…. There’s a node in between the SENSOR node and the data nodes… There will be other data sources for each of the sensor nodes so I wanted to isolate them a bit… hang on

Dose this help:-

image001.png

Thanks again

And what is NBFEED?

I think the query below might be close to what you want:

load csv with headers from "file:///sensor_data_2020_trunk.txt" AS row
WITH row, row.SENSOR AS sensor, [key in keys(row) WHERE key <> 'SENSOR' AND not(row[key] is null)] AS keys
MERGE (s:Sensor {id: sensor})
WITH s, row, keys
CALL {
 WITH s, row, keys
 UNWIND keys AS key
 MERGE (n:NbFeed {id: key})
 MERGE (s)-[feed:HAS_NBFEED]->(n)
 SET feed.value = row[key]
 RETURN count(*)
}
RETURN count(*)
1 Like

I'm marking your solution Mark. Its not quite but I think I can work it out with what you have done. My I realise that I changed the name on the middle node which has confused matters.
The biggest plus is it imports this csv in minuets not days.

Thank you

1 Like

Speed that up with indexing nodes. If you have at least one constraint on a node type mapped to a primary key. Set the constraint using cypher on that prop, and it gets faster.

1 Like

Thanks, Brandon, care to explain more? I know what indexing is I use it all the time but (and I'm probably nieve here) isn't the purpose of a graph db that you don't need to index?.. i didn't know it was an option but I have been pushed in the deep end and trying to figure this out whilst putting a project together. In the example above the sensor is the id's are what I would put a primary key on if it was say, MySQL or MS Access.
Appreciate the input :slight_smile: Thank you

Sure!

Having indexes allows neo to quickly lookup their array index for specific nodes in memory.

In neo, setting a node property constraint will automatically create an index for that property in the DB for you. Every query for 1 specific node ( be it a search or import merge search) will scan every node to verify how many match your query. If you have 750 nodes, every record has at least 750 DB hits when you don’t have a constraint. Set a constraint and that’s down to 1-3.

You can verify this if you use the explain and profile tools in the neo4j browser.

From my experience, you should be able to import those nodes in a matter of seconds.