I am trying to load up a fairly large set of data. My input file is fairly straightforward but big. The data is all cypher commands.
MERGE (:typeImportOnDemandDeclaration {nodeSeq:4,name:'importjava.io.*;',compileunit:'webgoat.combined.source',type:'typeImportOnDemandDeclaration'});
later in the program are the node connections
MATCH (a:ProgNode),(b:ProgNode) WITH a,b WHERE a.nodeSeq = 4 AND b.nodeSeq = 5 MERGE (a)-[r:Program{compileunit:'webgoat.combined.source', source:'webgoat.combined.source'}]->(b);
All of these are located in a single file coming in from multiple sources. When I wrote the original upload, I was fine with a few thousand nodes. But we just got a file with 100M and its a bit slow. I realize I was not doing it efficiently, so I needed to batch things up. That sounded easy. It has NOT been and the answers given all over the internet are creating more confusion.
To start, I cannot go back and rewrite for CSV for a variety of reasons. So unless someone can come up with a compelling CSV reason, thats out. It has to be some variant of the code below where the line variable is actually a complete cypher statement, as above. the "for line in FI:" loops over the 100m cypher lines. Label is not the same on each line. It varies.
This version used a single embedded string ( I know, clumsy) but none of my other variants had any better luck. The "payload" statement is the big one.
**batch_statement = """
UNWIND {batch} as row**
MERGE (n:Label {row.id})**
(ON CREATE) SET n += row.properties
"""
**
payload = '{batch: ['
maxcount = 4
with graphDB_Driver.session() as graphDB_Session:
start_time = time.time()
print("Starting Node load @ %s\n" % time.asctime())
# Create nodes
tx = graphDB_Session.begin_transaction()
for line in FI:
counter +=1
if counter >= startrow:
if (counter % maxcount) == 0:
print(counter)
payload = payload + payloadstring + "]" + batch_statement
# payload is the string I need to run.
tx.run(payload)
tx.commit()
print(" line %s was reached" % counter)
payload = '{batch: ['
time.sleep(3)
firstword = line.split()[0]
if firstword == "MATCH" and matchflag == False:
print(" Created %s nodes\n" % counter)
print(" Beginning links @ %s\n" % str(time.asctime()))
matchflag = True
elif firstword == "CREATE" and createflag == False:
print(" Beginning Node Creation\n")
createflag = True
elif firstword == "//" and postflag == False:
print(" %s @ %s\n" % (line[:-2], str(time.asctime())))
postflag = True
else:
print(" %s @ %s - unknown \n" % (line[:-2], str(time.asctime())))
if firstword != "//":
# break down the cypher into a key and a data
splitstart = line.find("{")
splitstop = line.find("}")
indexstring = "{id:'"+line[7:splitstart-1].strip()+"',"
payloadstring = indexstring + " properties:"+line[splitstart:splitstop]+"}"
payload = payload + payloadstring + ","
FO.close()
This seems basically easy to do but its beating me.
Thanks