Creating 200K relationships to a node is taking a lot of time in Neo4J 3.5?

I have one vertex like this

Vertex1

{
    name:'hello',
    id: '2',
    key: '12345',
    col1: 'value1',
    col2: 'value2',
    .......
}

Vertex2, Vertex3, ..... Vertex200K

{
    name:'hello',
    id: '1',
    key: '12345',
    col1: 'value1',
    col2: 'value2',
    .......
}

Cypher Query

MATCH (a:Dense1) where a.id <> "1" 
WITH a 
MATCH (b:Dense1) where b.id = "1" 
WITH a,b 
WHERE a.key = b.key 
MERGE (a)-[:PARENT_OF]->(b)

The end result should be Vertex1 should have a degree of 200K, therefore, there should be 200K relationships. However, the above query takes a lot of time pretty much killing the throughput to 500/second. Any ideas on how to create relationships/edges quicker?

When I run the profile and the cypher query above it keeps running forever and doesn't return so I reduced the size from 200K to 20K and here is what the profile is showing me.

enter image description here

Here is the picture that shows DB hits.

enter image description here

I guess you should run the addition of 200k in smaller batches to have a more reasonable transaction size. Use apoc for this:

call apoc.periodic.iterate("
"match (b:Dense1{id:'1'}), (a:Dense1)
where a.id<>'1' AND a.key=b.key RETURN a,b",
"merge (a)-[:PARENT_OF]->(b)",
{batchSize:10000, parallel:false});

sure apoc.periodic.iterate is fine for 200K. but if I increase it to 1M with batches of 10K I can see that the time takes to complete each batch is monotonically increasing. and 1M isn't a lot either so it's hard for me to see how this would scale for 1Billion vertices which is my ultimate goal.

If I change Merge to CREATE it is a lot better and super quick. Is there any way to avoid MERGE/Double-check-locking? Because currently, it is killing the performance significantly. All I want is if a relationship exists between two nodes I don't want it to overwrite or do anything if that can help avoid locking. If there is an option in the Neo4J driver to say ignore or throw an error if a relationship already exists that would really help.

At the moment we don't have a way to ignore or throw an error if the relationship already exists.

That said, there may be another option here.

The current query uses a MERGE for each row, that's an expansion of the dense node(s) per row (with double-checked locking) and that's what's killing this, I believe.

So let's use a different approach, locking on the single node, expanding it ONCE, collecting all the nodes it currently has relationships to, and only then matching on the nodes we want to connect, collecting them, subtracting the elements of one list from the other, and on the remaining using a FOREACH to CREATE the relationships:

MATCH (b:Dense1) 
WHERE b.id = "1" 
CALL apoc.lock.nodes([b])
WITH b, [(b)<-[:PARENT_OF]-(parent) | parent] as parents
MATCH (a:Dense1)
WHERE a <> b AND a.key = b.key 
WITH b, apoc.coll.subtract(collect(a), parents) as toCreate
FOREACH (a in toCreate | 
  CREATE (a)-[:PARENT_OF]->(b))

Hi @andrew.bowman Thanks a lot for this. When can we expect this sort of functionality? I am assuming this should give me a performance similar to CREATE UNIQUE because when I run with CREATE, CREATE UNIQUE and MERGE this is total time taken for each case.

CREATE ~1.4 second (when run only once after inserting all the vertices but this does not work for me because I need to construct graph incrementally in batches but just out of curiosity I gave this approach a shot )
CREATE UNIQUE ~ 6 seconds
MERGE ~13 seconds.

However, I will run the above query and paste the results soon.

Another approach I was thinking of also doesnt exist today is be creating a uniqueness constraint on edge property. something like below.

For example, I create the graph incrementally (say in batches from Kafka by repeatedly applying the below operation for every batch):

CALL apoc.periodic.iterate("
MATCH (a:Dense1) 
WHERE a.id <> '1'
MATCH (b:Dense1) 
WHERE b.id = '1' AND a.key = b.key 
RETURN a, b",
"CREATE (a)-[:PARENT_OF {edge_id: a.uid + "," + b.uid}]->(b)",
{}) YIELD batches, total, errorMessages
RETURN batches, total, errorMessages
CREATE CONSTRAINT ON ()-[r:PARENT_OF]-() ASSERT r.edge_id IS UNIQUE;

@andrew.bowman I ran your query and here are the results.

The very first run it took ~1.1 seconds similar to CREATE. The second run takes 126 seconds. so tells me this wont be efficient either to incrementally construct the graph in batches.

Here is the query plan

Did you check how long your first, driving query runs?

it looks quite expensive. I guess you'd want to index "key" and use that to drive the lookup and not the a.id<>'1'

I don't think the locks are the problem here. You said this took 1.1 seconds to CREATE, and that path is the one that uses the most locks (the one b node and every single of the nodes to connect to it).

In the second run the only lock taken is on the b node, all the other nodes get filtered out and aren't present when we get to the CREATE, so no locks are taken on any of them.

The bottleneck looks to be the comparison and subtraction operations.

(just to clarify only using CREATE still uses locks, that's required in order to add the relationships in an ACIDic way, so since you previously had good speed with just CREATE, the locks are not the bottleneck on this one).
I'm going to play a bit more with this.

Okay, I got this.

The problem with my last query was with the subtraction operation. We'll need to look into what's going on the code to see if we can improve that.

The good news is, we don't need to use it in this case, we can just default to a list membership check. I had thought this would be less efficient, but turns out it's pretty fast.

Here you go, give it a try:

MATCH (b:Dense1) 
WHERE b.id = "1" 
CALL apoc.lock.nodes([b])
WITH b, [(b)<-[:PARENT_OF]-(parent) | parent] as parents
MATCH (a:Dense1)
WHERE a.key = b.key AND NOT a in parents AND a <> b
CREATE (a)-[:PARENT_OF]->(b)

Okay, so with a bit more tinkering, we can actually use a simpler query where we don't have to deal with collections at all, it should be comparably quick:

MATCH (b:Dense1) 
WHERE b.id = "1" 
CALL apoc.lock.nodes([b])
MATCH (a:Dense1)
WHERE a.key = b.key AND a <> b AND NOT (a)-[:PARENT_OF]->(b) 
CREATE (a)-[:PARENT_OF]->(b)

EDIT:

So this seems to be quick in some cases, but not all. This seems to hang for me on the first run, when no relationships are present.

I think the bottleneck in this case is that when determining if the nodes are connected, it's expanding from b, which is cheap at the start, but gets increasingly expensive as more nodes get connected.

We can remedy this somewhat by using an APOC function which uses a connected check which is optimized for dense nodes. In this case, it figures out during runtime that it's cheaper to check for a connection from the a nodes instead of the now-dense b node.

MATCH (b:Dense1) 
WHERE b.id = "1" 
CALL apoc.lock.nodes([b])
MATCH (a:Dense1)
WHERE a.key = b.key AND a <> b AND NOT apoc.nodes.connected(a, b, 'PARENT_OF>')
CREATE (a)-[:PARENT_OF]->(b)

Now that I look at this, I think we've identified the root of the slowdown, including within the original query with MERGE.

The problem isn't with locks, or double-checked locking (okay it's KIND of related, but not the heart of this).

The problem is the direction the planner is choosing to test whether the nodes are connected (which is part of the MERGE operation...twice, because of the double-checked locking), and how it isn't coping with the changes to the degree of the relationship as the relationships are added and the node becomes dense.

Again, at the start, b has degree 0, so the planner uses an expand from b to a to see if the nodes are connected. And that's just fine until enough relationships are added that b has become dense, and the cost for this check becomes increasingly expensive. The more dense it gets, the worse it performs for the next row.

So there are a few ways to solve this.

In my first (viable and working) solution with collections and the list membership check (not using subtract()) we avoided the expansion problem completely. This is probably a great general solution that should work in all cases.

In the second efficient solution (using apoc.nodes.connected()) the function allowed the query to choose at runtime, on a row-by-row basis, which node to expand from to test for the check, so it avoids the cost of expanding from the now-dense node b. This should be a good solution as long as both nodes are not dense.

And just to prove what's going on...if you create your nodes, but add some relationships on b at the very start, and use the query with MERGE from the start (with a slight tweak), you should see it should only take a couple seconds, because the planner seems to choose to expand from a nodes rather than the b node, avoiding being bitten as b becomes dense:

unwind range(1,200000) as id
create (:Dense1 {id:toString(id), key:12345})

now seed some relationships on b

match (b:Dense1{id:"1"})
unwind range(1,100) as id
create (:Dummy)-[:PARENT_OF]->(b)

and use the MERGE query

profile
MATCH (b:Dense1) where b.id = "1" 
MATCH (a:Dense1) WHERE a.key = b.key and  a.id <> "1"
MERGE (a)-[:PARENT_OF]->(b)

I'll see what our Cypher engineers can use from this.

@michael.hunger Tried creating an index as you suggest doesn't seem to improve at all.

@andrew.bowman Thanks a lot for this! That a good explanation. I tried as you suggested and I agree with your analysis. That said I am surprised why this is not taken care of by the planner? Why not use the stats table like RDBMS and construct an optimized plan?

Users just want to write a query and the query engine should restructure it if there is a better way (the whole point of declarative programming) or at very least expose an interface so a user can write rules in cases like this such that the query engine can utilize.

Also apoc is fine for now but it feels more like it is glued from somewhere instead of native support from the engine.

I hope neo4J supports more of these things natively.

While the planner is taking care of it (I believe it's the planner that decided the direction to use at the start), the fact of the matter is that I think this decision shouldn't happen in the planner itself, which is finalized before the query starts to execute, but during runtime, so it can cope with changes introduced as the query executes. Note that while it was obvious in this particular case that we would be making b into a dense node, and the planner could see this and react accordingly, we can't say the same for all queries, it may be data driven and the planner would not pick up on this. So I don't think the planner is the right place for this decision.

I would consider this current behavior a bug. We're putting this on the Cypher backlog and we'll see how we want to deal with it.

@andrew.bowman Just out of Curiosity why is locking needed to check if a relationship or node exists? For updates or deletions, I can understand but for reads, I am not seeing why the lock is necessary?

"Note that while it was obvious in this particular case that we would be making b into a dense node, and the planner could see this and react accordingly, we can't say the same for all queries"

I don't think its about one query or all queries. This is indeed a basic thing and I am not sure why planner cannot get this basic information and adapt.

This is in relation to a MERGE operation. For a relationship, a MERGE is like a MATCH, and if the pattern doesn't exist, then a CREATE of the relationship.

The first MATCH from the MERGE is done without locks, since if the relationship exists it will match on what's there and all is good, nothing needs to be created, no locks need to be taken.

But if the MATCH fails, and the relationship doesn't exist, then locks are taken, and then the double-check for that MATCH is run (since there is the possibility that a separate transaction may have created the relationship between the time of your check and the time that you took the locks), and then if the relationship still needs to be created, the CREATE part is executed.

In one of the queries I provided, I took locks before the read. That was because I wasn't using MERGE, I was using CREATE, but I still needed to guard against the race condition, so I preemptively took the lock on the node, MATCHed out, filtered, and then could perform the CREATE knowing that because I took the locks early that the results of the earlier MATCH could no longer be changed by a concurrent transaction.

Regarding my comment about the planner, please try using EXPLAIN or PROFILE on the query. You'll note that the operations planned are for the entire query. That means that the direction of the expand operation would have to be planned before the query executed, and would not change during execution. The planner's choice wouldn't be flexible during execution. That means the choice of how to do this check shouldn't be decided by the planner (which doesn't have the flexibility to change itself during execution), but at runtime as a part of how MERGE and MATCH execute. Keep in mind, once the query is planned, and once the query starts executing, the planner is OUT of the picture.

The "basic information" the planner has to use does not include the parameters passed in, since the planning happens before the query executes, not during, and having the planning depend upon the data would put a very odd dependency in place, and would likely mean the planning step would have to pre-process the data somehow, so the amount of data would influence the planning (also meaning that the more data is passed to a query, the longer the planning would take. It would also invalidate the ability to have query plan caching, since the data would cause us to toss previous plans for the query every time). That's contrary to what the planner is for. The query plan should not have to depend upon the parameters passed in.

Again, the planner is the wrong place for these decisions to happen. They must happen at runtime to be efficient to cope to changes that happen during runtime.

The problem sounds akin to RDMS planner trying to choose between SortMergeJOIN and BroadCast Join. As an example here is how spark does adaptive scheduling.

Also what kind of locks does it take is it similar to synchronized block or spin lock? From what I looked into the code it looks like synchronized which takes a good amount CPU cycles. Is double-checked locking unavoidable? Isn't there any other way? Why not use lock free or share nothing architecture?

Thanks a lot for all your responses!

I can't currently answer about the planning here or the lock types, I'll see if our engineers have comment.

For double-checked locking, for correctness it is unavoidable (otherwise we risk adding duplicates) but there are some things we can do to improve this, such as detecting when we already hold at least one of the locks involved and decide that a double-check is unnecessary. That seems like low hanging fruit for improvement.

As far as your suggestions, if I'm hearing right you're talking about a distributed/sharded deployment. This is under evaluation, but keep in mind that your suggestion may not be a good fit for an ACID database, especially as we have transactional metrics that are collected. You may see some exploration into this in the near future, but I'm not sure about lock architecture.