CSV import hanging server


(Massung) #1

I have a rather large cypher query to import a CSV, the current version that exists works fine, but I'm trying to add one bit to it which seems to be blowing up the server.

Here's the current (working) version:

using periodic commit load csv with headers from 'https://s3..csv' as r

// lookup the analysis node to connect to later on
match (q:Analysis) where ID(q)=$id

// create the variant node if it doesn't exist
merge (v:Variant {name: r.variant_id})
on create set v.field = ...

// create the consequence node
create (n:TranscriptConsequence {
  // ~120 fields being initialized from the CSV here

// create connections
merge (n)-[:PRODUCED_BY]->(q)
merge (n)-[:FOR_VARIANT]->(v)

Now, what I'm attempting to do is take one of the fields in the CSV (a gene_id) and - if it exist - create a connection to a Gene node already in the database. So, I've modified the query like so:

match (q:Analysis) where ID(q)=$id

// lookup the optional gene for this consequence
optional match (g:Gene) where r.gene_id is not null and g.name = r.gene_id

// ... create nodes + relationships ...

// optionally create a relationship to the gene
with n, case g when null then [] else [g] end as gs
foreach (g in gs | merge (n)-[:FOR_GENE]->(g))

Adding this bit of code ends up hanging the server. The query never inserts anything into the database, there are no cypher errors/warnings, and all cypher queries done from then on never return (or timeout for that matter). It's as if the server has locked the DB and every future query is waiting for a global transaction to finish. I have to kill -9 the server and restart it.

The last log entries that have any new information are in debug.log:

2019-01-05 02:54:46.490+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 321297 checkpoint started...
2019-01-05 02:54:48.002+0000 INFO [o.n.k.i.s.c.CountsTracker] Rotated counts store at transaction 321297 to [/home/ec2-user/efs/neo4j/data/databases/graph.db/neostore.counts.db.b], from [/home/ec2-user/efs/neo4j/data/databases/graph.db/neostore.counts.db.a].
2019-01-05 02:57:49.822+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by scheduler for time threshold @ txId: 321297 checkpoint completed in 3m 3s 326ms
2019-01-05 02:57:50.866+0000 INFO [o.n.k.i.t.l.p.LogPruningImpl] No log version pruned, last checkpoint was made in version 1070

I'm hoping someone here can point out to me the error of my ways or tell me a better way to do what I'd like?

I'm using neo4j version 3.4.4.


(Stefan Armbruster) #2

I guess you're suffering the well known eager pipe issue. There are a few blog posts out there on this.
Bottom line: have multiple "load csv" commands where each one does only a smaller part of the work to prevent eager pipe.

(Massung) #3

Thanks for the reply. I can give that a try.

Out of curiosity, is my method for optionally fetching (and connecting to) the other node the best way of doing that?

(Stefan Armbruster) #4

Instead of the case ... foreach you could simply do a where g <> null and do an unconditional merge afterwards.

(Massung) #5

Sure, that's fine (now) given that it's in a separate query where such a filter is OK because there's nothing else happening other than that single merge (note: this did fix my issue and thank you!).

But, let's say the original query didn't suffer from the "eager pipeline" issue. Would the optional match and foreach be the way to go still in that case still, or is there some other way that would be more readable?

For example, I imagine just doing:

load csv ... as r

optional match (a:A) where a.name = r.aName
create (b:B { ... })

// other merge operations here...

merge (b)-[:HAS]->(a)

I assumed that would break if a was null, hence why I chose the foreach.