I am running a pipeline from S3 to spark and finally to neo4j. I am using neo4j bolt driver for python to write data. I have 1600 files on S3, I used UNWIND to batch write and also create a constraint for name. Now my database is 31 GB and I am using neo4j 3.5.14.
'''CREATE CONSTRAINT ON (p:PATENT) ASSERT p.name IS UNIQUE'''
'''WITH $names AS nested
UNWIND nested AS x
MERGE (w:PATENT {name: x[0]})
MERGE (n:PATENT {name: x[1]})
MERGE (w)-[r:CITE]-(n)
'''
My problem is at first the writing is pretty fast, like 30s for each file. Then started from this afternoon, the writing started to slow down to 4-5 minutes per file. I checked the debug.log for neo4j, seems like there is garbage collection operation, see below
2020-02-09 07:56:08.596+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=144, gcTime=203, gcCount=1}
2020-02-09 07:56:58.786+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=217, gcTime=244, gcCount=1}
2020-02-09 07:57:36.050+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=112, gcTime=142, gcCount=1}
Then I checked my memory usage, seems like neo4j uses over 50% of memory,
For now I think the slow down problem may be caused by the memory usage, could some one please help?
Update: I tested the writing on a new empty database, and the speed is back. So the problem now is, when the database grows to a certain size (mine is 31 GB now), will that affect the writing performance? I only have 8G RAM on my machine, is that too low?