I have a 15M x 10 dataframe in pyspark to load as nodes on a neo4j label.
using neo4j server (5.23.0) on an Azure vm, connect to it using pyspark on databricks.
previously have tried py2neo create_nodes to create the 15M nodes, in 10 batches. it has completed in 15-20 mins every time and the numbers have been correct.
have used the same appraoch for creating edges (py2neo.create_relationship) in 10 batches, but occasionally (randomly) some edges don't get created!
So, i've looked into neo4j module (tx.run(query)) and apoc.merge.node hoping they will be faster and more reliable,but considering the below details the process is very slow (taking hours and not getting anywhere near completion) and often neo4j db ends up shutting down. and this is just on loading the nodes...
tried with 10, 100 and 10000 batches, the process is very slow and often neo4j db ends up shutting down.
tried both using neo4j tx.run(query) with a simple merge query AND apoc.merge.node.... the data is batched up before passing on to the query.
is there an index of node labels and the property number and name?
For performance reasons, creating a schema index on the label or property is highly recommended when using MERGE. See Create, show, and delete indexes for more information.
Thanks Dana!
i tried creating 'indices' on name and number and it did speed up the merge.
Just confirming my understanding:
the merge will compare ALL properties of the new node with all existing nodes and if it doesn't find a node with exact same properties, creates a new one.
if this is correct, then does it make sense to create indices on ALL the properties (rather than just name/number).
is there additional computation/search overhead when we have many indices, vs only a few?
the merge will compare ALL properties of the new node with all existing nodes and if it doesn't find a node with exact same properties, creates a new one.
match (n:Order {ord_id: $orderID})
on create
set n.order_date=date(),
n.order_vendor=$vendor
n.order_amt=$amt
on match
set n.update_date=date()
;
and to which $orderID is similar to a primary key.
and to which the above cypher is effectively ... find a node with label :Order and has a property = $orderID, use an index if available otherwise scan all :Orders, and if not found then create otherwise update
I've looked at this this merge documentation which introduces variety of merge options, some of which do not seem to be including any 'match' conditions, such as
MERGE (robert:Critic)
RETURN labels(robert)
in
what happens in a case like this? would it create a new node with name 'robert' regardless of whether or not a robert exists in the data?
on a similar note, keen to understand if my code below actually does check for any key-matches (similar to your example with orderID) or does it create regardless?
if something fails while running one of the batches (such as due to neo4j connection failure, certain row's merge failure, etc), does the above code do any of the following:
retry the merge for failed rows (batches)
leave a certain error message in debug.log or other neo4j logs
rollback or something to that effect?
Using py2neo before I had noticed the whole code running successfully, but certain edges were not created.
Keen to know how this can be detected and remedied.
the merge will compare ALL properties of the new node with all existing nodes and if it doesn't find a node with exact same properties, creates a new one.
find me a node with label :Critic and alias this node as robert for query runtime. if it exists return all lables for this node, if not then create the :Critic node