Create 1 node for use during CSV loading


(Massung) #1

Simply put, I have a Spark job that's run, the output of which is a CSV. Currently, I load this CSV into Neo4j without problems. But, I'd like to do one more step: add a single node representing information about the job run to the graph that will link to every row of the CSV (the output).

Currently, my Cypher query looks like so:

using periodic commit
load csv with headers from $url as line

merge (n:Output {name:})
on create set ...
on match set ...

What I'd like to do is add the one new node so I can additionally create a relationship. For example (which fails, obviously):

create (job:Job {
  time: timestamp(),
  source: $hdfsLocation,

using periodic commit
load csv ...

merge (job)-[:PRODUCED]->(n)

I've tried variations of the above to no avail. Maybe I'm just missing a comma or something?

In case it comes up as a possible solution: I don't have anything (currently) that uniquely identifies the "job". The HDFS location - for example - is used many times over with different arguments, so I don't want to overwrite an existing job using the same source script. I could potentially create 2 queries: first create the job node, then load the CSV, but I'm unsure how to get (non-unique) job node from the first and into the second?

Thanks in advance!

(Benoit Simard) #2


For this the best is to create two statements :

  • Create the Job node, and retrieve the node's ID : create (job:Job { time: timestamp(), source: $hdfsLocation, }) RETURN id(job) AS id

  • Then load your CSV file like this :

  MATCH (n) WHERE id(n) = $id
  MERGE (n:Output {name: })
  MERGE (job)-[:PRODUCED]->(n)

(Massung) #3

This was my first follow-up idea as well, but looking online, it's highly suggested by the Neo4j team to avoid using ID (which surprises me, given that the function is exposed).

Thanks for the idea and example. I'll likely end up doing that if no other solution presents itself.

(Benoit Simard) #4

You must not use the technical ID of nodes as a business key, Neo4j reuse the IDs.

When a node is created, it receives an ID, and this one will be the same during all its life.
But, if you delete the node 44, and just after you create a new node, the new node can obtains the id 44.

So inside a transaction (or in your use-case) you can use the nodes ID.