Using neo4j module and/or apoc to merge large number of nodes

Hi,

I have a 15M x 10 dataframe in pyspark to load as nodes on a neo4j label.
using neo4j server (5.23.0) on an Azure vm, connect to it using pyspark on databricks.

previously have tried py2neo create_nodes to create the 15M nodes, in 10 batches. it has completed in 15-20 mins every time and the numbers have been correct.
have used the same appraoch for creating edges (py2neo.create_relationship) in 10 batches, but occasionally (randomly) some edges don't get created!

So, i've looked into neo4j module (tx.run(query)) and apoc.merge.node hoping they will be faster and more reliable,but considering the below details the process is very slow (taking hours and not getting anywhere near completion) and often neo4j db ends up shutting down. and this is just on loading the nodes...

  • tried with 10, 100 and 10000 batches, the process is very slow and often neo4j db ends up shutting down.
  • tried both using neo4j tx.run(query) with a simple merge query AND apoc.merge.node.... the data is batched up before passing on to the query.
  • the node load method looks like this

def apoc_load_org_nodes(tx, label, data):
query = f"""
UNWIND $batch as row
CALL apoc.merge.node(
["{label}"],
{{number: row.ID,
name: row.NAME,
...
}})
YIELD node
RETURN node
"""
tx.run(query, batch=data)

@ava.bargi

is there an index of node labels and the property number and name?


For performance reasons, creating a schema index on the label or property is highly recommended when using MERGE. See Create, show, and delete indexes for more information.

Thanks Dana!
i tried creating 'indices' on name and number and it did speed up the merge.

Just confirming my understanding:

  • the merge will compare ALL properties of the new node with all existing nodes and if it doesn't find a node with exact same properties, creates a new one.

if this is correct, then does it make sense to create indices on ALL the properties (rather than just name/number).

is there additional computation/search overhead when we have many indices, vs only a few?

@ava.bargi

  • the merge will compare ALL properties of the new node with all existing nodes and if it doesn't find a node with exact same properties, creates a new one.

is that from a Neo4j documentation - Neo4j Documentation page?

typically a merge might be

match (n:Order {ord_id: $orderID}) 
  on create 
         set n.order_date=date(),
               n.order_vendor=$vendor
               n.order_amt=$amt
   on match 
          set n.update_date=date()

;

and to which $orderID is similar to a primary key.
and to which the above cypher is effectively ... find a node with label :Order and has a property = $orderID, use an index if available otherwise scan all :Orders, and if not found then create otherwise update

I've looked at this this merge documentation which introduces variety of merge options, some of which do not seem to be including any 'match' conditions, such as

MERGE (robert:Critic)
RETURN labels(robert)

in

what happens in a case like this? would it create a new node with name 'robert' regardless of whether or not a robert exists in the data?

on a similar note, keen to understand if my code below actually does check for any key-matches (similar to your example with orderID) or does it create regardless?

def apoc_load_org_nodes(tx, label, data):
        query = f"""
        UNWIND $batch as row
        CALL apoc.merge.node(
        ["{label}"],
        {{number: row.ID,
        name: row.NAME,
        ...
        }})
        YIELD node
        RETURN node
        """
        tx.run(query, batch=data)

Another question:

if something fails while running one of the batches (such as due to neo4j connection failure, certain row's merge failure, etc), does the above code do any of the following:

  • retry the merge for failed rows (batches)
  • leave a certain error message in debug.log or other neo4j logs
  • rollback or something to that effect?

Using py2neo before I had noticed the whole code running successfully, but certain edges were not created.

Keen to know how this can be detected and remedied.

Thanks!

@ava.bargi

  • the merge will compare ALL properties of the new node with all existing nodes and if it doesn't find a node with exact same properties, creates a new one.

is that from a Neo4j documentation - Neo4j Documentation page?

MERGE (robert:Critic)
RETURN labels(robert)

find me a node with label :Critic and alias this node as robert for query runtime. if it exists return all lables for this node, if not then create the :Critic node

Using py2neo ....

it should be noted py2neo is not an official Neo4j Python driver unlike the officially supported drivers at Neo4j Deployment Center - Graph Database & Analytics