Creating relationship over several millions of nodes

Hello!

I have 2 node labels, one of them has around 750K nodes, and another one with almost 50millions nodes. I have a stand-alone neo4j 3.5 running with 8 core and 96g memory. I am creating a relationship between these node labels using apoc.periodic.itererate but this query is been running for too long.
Some of the node label properties get refreshed every day and I want to be able to refresh the relationship once the node properties are updated in both labels. The last time when I ran the query to update the relationship, it was running for more than 8 hrs and I had to kill it.
Do you guys have any solution for this to speed up this process? Should I be doing this in a different way? If so, would you please let me know the approach.(I'm new to Neo4j).

Thanks!!

call apoc.periodic.iterate("match (i:Inventory) , (p:Products) where p.p3 <> '0' and i.p1= p.p1 OR where p.p3 <> '0' and i.p2= p.p2 return i,p","merge (i) -[:PRODUCES]->(p)",{batchSize:20000, parallel:false})

This is how the query looks like and I have indexes on p1,p2 and p3.

  1. Negation is expensive
    (good formatting isn't) :stuck_out_tongue_winking_eye:

First, lest get this one running faster:

MATCH (i:Inventory), (p:Products)
WHERE p.p3 <> '0' AND  i.p1= p.p1 OR # this one is gonna *kill* your cpy and ram
WHERE p.p3 <> '0' AND i.p2= p.p2
RETURN i,p

So... let's make sure we only need to do that negation once, and group the rest of the WHERE clause more carefully

MATCH (p:Products)
WHERE p.p3 <> '0'

WITH p
MATCH (i :Inventory) 
WHERE i.p1= p.p1 OR i.p2= p.p2
RETURN i,p

...which will get you to a faster solution:

CALL apoc.periodic.iterate("
    MATCH (p:Products) WHERE p.p3 <> '0' WITH p
    MATCH (i :Inventory) WHERE i.p1= p.p1 OR i.p2= p.p2
    RETURN i,p
","
    MERGE (i) -[:PRODUCES]->(p)
",{batchSize:20000, parallel:false})

However, these would probably be better handled when importing data, or changes. When you're updating the inventory, the additional Cypher to adjust these relationships wouldn't have to check the entire DB. I strongly advise taking a very close look at whatever is creating/updating this data, as it will be orders of magnitude more efficient to do it then, instead of after the fact.

5 Likes

Thank you @tony.chiboucas!! I'll try your query and see if it helps.

I'm sorry about the formatting. :)

I absolutely agree with you. I should be creating this relationship when the Product nodes are created/updated batch-wise. I am using golang-neo4j driver and session.WriteTransaction to write/update 20000 Product nodes at a time using go-routine. I will just plug in this cypher query into that go-routine function.
Since I am still in the testing phase I wanted to give this query a try by running manually in the browser.

Thanks again!

Could you share the query/command that is creating/updating the nodes to begin with?

sure .. here is it is - I pasted only the data create portions


func InsertToNeo(data interface{}) error {
	m := ArrayStructToMap(data)
	session, err := NewSession(InfraObj.NeoDriver)
	if err != nil {
		return err
	}
	defer session.Close()

	_, err = session.WriteTransaction(**execute**(m), neo4j.WithTxTimeout(300*time.Second))
	if err != nil {
		return err
	}

	return nil
} 

func execute(data interface{}) func(tx neo4j.Transaction) (interface{}, error) {
	return func(tx neo4j.Transaction) (interface{}, error) {
		result, err := tx.Run(
			`unwind $events as event MERGE (a:Products {key: event.key}) SET
			          a.p1 = event.p1,
		              a.p2 = event.p2,
                      a.key = event.key
                       .
                       .
                       .
                       .
                       .
	              a.p10 = event.p10`,
			map[string]interface{}{"events": data})
		if err != nil {
			return nil, err
		}
		return nil, result.Err()
	}
}
1 Like

I am still not sure how to plug these events (data) to relationship creation query. If you know how to do it please let me know. That would save a ton of my time :slight_smile: Thanks in advance!

UNWIND $events as event
MERGE (a:Products {key: event.key})
SET
	a.p1 = event.p1,
	a.p2 = event.p2,
	a.key = event.key,
	...
# Just add at the end...

WITH a
MATCH (i :Inventory)
WHERE i.p1 = a.p1 OR i.p2 = a.p2
MERGE (i)-[:PRODUCES]->(a)

Note: This won't clean up old connections... so as your data changes, old i - p relationships will persist. You'll want to come up with a strategy for when/how to clear old :PRODUCES rels.

2 Likes

WOW, I honestly didn't think this is going to be this easy :slight_smile: Thanks so much!!!!

Yes, I am going to wipe the DB and reload the data from scratch.

Ohh .. a qq - do you think I still need to plug in the below query to my Products update function? or just enough running this query in the create function like above?

MATCH (i :Inventory)
WHERE i.p1 = a.p1 OR i.p2 = a.p2
MERGE (i)-[:PRODUCES]->(a)

That's the point of doing it during CREATE, the relationships are created then. No extra work needed. Everything better, faster, stronger.

gotcha.. thanks a ton, Sir!
Have a great day!

I'm sorry to bother you again. :frowning:

It seems that it's still taking too long to create the relationship. Below is my code snippet

result, err := tx.Run(
		`unwind $events as event MERGE (a:Products {key: event.key}) SET
			a.p1 = event.p1,
			a.p2 = event.p2,
			a.key = event.key,
			.
            .
		    .	
            a.pN = event.pN

			WITH a MATCH i:Inventory) WHERE i.p1 = a.p1 OR i.p2 = a.p2  
			MERGE (i) -[:PRODUCES]->(a)`,
			map[string]interface{}{"events": data})

For testing purposes, I am only creating 25 a records at a time and trying to create PRODUCES rel with 750K i records. It's taking forever.... (I just killed the program).
If you have any insight on this please let me know. Thank you!

  1. Did you create indices on i.p1 and i.p2?
CREATE INDEX inventory_p1 FOR (i:Inventory) ON (i.p1)
CREATE INDEX inventory_p2 FOR (i:Inventory) ON (i.p2)
  1. You're missing a parenthesis: WITH a MATCH i:Inventory)

  2. Isolate code steps, and get the profile of that Cypher.

result, err := tx.Run(
    `UNWIND $events AS event CREATE (a:TempEvent) SET a = event`,
    map[string]interface{}{"events": data})

Once that's run once, open a Neo4j Browser, and run the next step there, with EXPLAIN. That'll show you where the bottleneck is.

EXPLAIN 
MATCH (a:TempEvent)
MATCH (i:Inventory) WHERE i.p1 = a.p1 OR i.p2 = a.p2  
MERGE (i) -[:PRODUCES]->(a)

Note: simplify

If you really are trying to copy all $event properties onto a :Product node, and an $event never contains an object in a property, you can simply do the following:

result, err := tx.Run(
    `UNWIND $events AS event
     MERGE (a:Product {key: event.key}) SET a = event`,
    map[string]interface{}{"events": data}
)

If an event does not meet that criteria, I'd give that Class a method which returns the structure you want in Neo that does meet that criteria.

Thank you for your time.

  1. It was created before running the program.
  2. Sorry it was copy/paste error here
  3. correct.. an $event does not contain an object and your unwind just worked fine. I'll follow this for creating new nodes.

Here is the EXPLAIN o/p. I didn't get much out of it. I am reading about EXPLAIN now and meanwhile, I thought I would share the screenshot here. Thanks!

....i just tried below query in the browser without OR condition.. it finished within 1ms.

MATCH (a:TempEvent) 
MATCH (i:Inventory) WHERE i.p1 = a.p2 
MERGE (i) -[:PRODUCES]->(a)

That profile looks good. It is using the index. These should be running pretty quick, even with such a large dataset.

... and how'd it do when you keep the OR condition?

I had to kill it after 30 mins. I was tracking CPU usage as well while it was running. I did not see it's going beyond 13%

Okay, the for whatever reason, it's trying to load all the :Inventory before applying the OR to exclude them. Shouldn't be doing that, but that'll be a tough one to debug.

We can work around the problem by never using an OR, and just calling the MATCH ... MERGE once for each property. The MERGE prevents creating duplicate rels, so there will only ever be one or none [:PRODUCES] for each i : p pair.

result, err := tx.Run(`
        UNWIND $events AS event
        MERGE (a:Product {key: event.key})
        SET a = event

        WITH a MATCH (i:Inventory) WHERE i.p1 = a.p1 MERGE (i)-[:PRODUCES]-(a)
        WITH a MATCH (i:Inventory) WHERE i.p2 = a.p2 MERGE (i)-[:PRODUCES]-(a)
    `,
    map[string]interface{}{"events": data}
)

...sorry there's not really a cleaner way to do that without writing your own Neo4j procedure.

Now at least we know why your process was taking forever... it was trying to load your entire :Inventory into RAM.

Ok .. that makes sense. Thanks!!
(My earlier statement was incorrect so deleted it)

result, err := tx.Run(`
        UNWIND $events AS event
        MERGE (a:Product {key: event.key})
        SET a = event

        WITH a MATCH (i:Inventory) WHERE i.p1 = a.p1 MERGE (i)-[:PRODUCES]-(a)
        WITH a MATCH (i:Inventory) WHERE i.p2 = a.p2 MERGE (i)-[:PRODUCES]-(a)
    `,
    map[string]interface{}{"events": data}
)

I tried this but it was still hanging and my program wouldn't complete. I killed it and interchanged the position of 2 queries in my code and voila it finished within a few ms.

result, err := tx.Run(`
        UNWIND $events AS event
        MERGE (a:Product {key: event.key})
        SET a = event

        WITH a MATCH (i:Inventory) WHERE i.p2 = a.p2 MERGE (i)-[:PRODUCES]-(a)
        WITH a MATCH (i:Inventory) WHERE i.p1 = a.p1 MERGE (i)-[:PRODUCES]-(a)
    `,
    map[string]interface{}{"events": data}
)

I am going to test it against 5000 nodes now. Will see how it behaves.

Thanks!