Creating single relationship instead of two in order to drive efficiency

Good day all,

Being new to graph and neo4j I am having a bit of trouble in my execution.

Setup: I have a data model of (:Client)-[:HAS_A_CHARACTERISTIC]->(:Characteristic). There are many characteristics (~230) and I am using these characteristics to create a direct relation between clients (i.e. (:Client)-[:MUTUAL]-(:Client)). The [:MUTUAL] relation will have a strength of [common_characteristics / max(client1_chars,client2_chars)].

I am able to calculate these values and even create these relationships, however it goes without saying that these are actually bi-directional relations, and therefore two relationships are being created. This is not a problem with a very small subset, however I have ~2.2mil client nodes.

How do I create a single relationship?

Below is the query that I have used:

match (i:`Cape Town`)-[]->(b)<-[]-(j:`Cape Town`)
with i, j, count(b) as comm_bh
unwind [i.bh_degree,j.bh_degree] as bh_degs
with i,j,comm_bh,max(bh_degs) as deg_denom
with i,j,comm_bh,deg_denom, case when exists((j)--(i)) then true else false end as created_ind
foreach (n in case when created_ind then [] else [1] end |
    create (i)-[m:MUTUAL]->(j)    // merge also creates duplicates...
    set m.strength=(comm_bh*1.0/deg_denom))
return i as start_node, comm_bh, deg_denom,created_ind, j as end_node

Hi @joestry ,

Can you try something like?

match (i:`Cape Town`)-[]->(b)<-[]-(j:`Cape Town`)
with i, j, count(b) as comm_bh
where id(i) < id(j)
unwind [i.bh_degree,j.bh_degree] as bh_degs
with i,j,comm_bh,max(bh_degs) as deg_denom
with i,j,comm_bh,deg_denom, case when exists((j)--(i)) then true else false end as created_ind
foreach (n in case when created_ind then [] else [1] end |
    create (i)-[m:MUTUAL]->(j)    // merge also creates duplicates...
    set m.strength=(comm_bh*1.0/deg_denom))
return i as start_node, comm_bh, deg_denom,created_ind, j as end_node

Lemme know how it goes.

B

Hi @bennu_neo -- Just on the previous reply... my CPU was at 100% utilization when I ran the apoc.periodic.iterate(), now it seems to be sitting around 30-50% utilization. and the process is still running...

The memory is still running at 100% of the allotment.

The data query is the trimmed graph (which I do not really want to do), and in my operation query is simply return 1 -- this may have something to do with the lower CPU utilization...

Hi @Bennu

Thank you for the above. I only saw it now and have been trying a lot of different things today, which have been causing heap memory outages -- which I was anticipating.

I will try your code shortly and give you feedback, there is just an APOC function running now which I do not want to stop... I did that previously and needed to reinstall the entire suite (...still learning my way around neo4j).

What I have tried to not let the memory outages happen is the following:

  • Increased my heap memory config (memrec in parenthesis, but were still getting outages) -- still getting error
    • dbms.memory.heap.initial_size=8G (5100m)
      dbms.memory.heap.max_size=8G (5100m)
      dbms.memory.pagecache.size=8G (7000m)
  • Used apoc.periodic.iterate() (please see query below)-- still getting error
  • Now trimming my nodes (for one subgraph went down from ~400k to ~220k nodes) -- this is still running

The apoc query (which is still causing memory outages on the ~400k nodes):

call apoc.periodic.iterate(
    "match (i:Durban)-[]->(b)<-[]-(j:Durban)
    where not b:City and not b:Suburb
    return i,j,b
    ",
    "
    with i, j, count(b) as comm_bh
    unwind [i.bh_degree,j.bh_degree] as bh_degs
    with i,j,comm_bh,max(bh_degs) as deg_denom
    where (comm_bh*1.0/deg_denom)>=0.1
    foreach (n in case when (comm_bh*1.0/deg_denom)>=0.2 then [1] else [] end |
        create (i)-[m:MUTUAL]->(j)
        set m.common_bh=comm_bh,m.deg_denominator=deg_denom,m.strength=(comm_bh*1.0/deg_denom))",
    {batchSize:5000, parallel:false}
    )

Hi @bennu_neo

Thank you for the query above. I did run it and the relations did halve as expected.

However, the m.strength and m.deg_denom is not being populated any more. This was unexpected, but does make sense to a degree as it seems as though the node with the higher id is being excluded from the remainder of the execution plan...?

This does pose a problem as m.strength is really the property that I am after. It forms the basis of the analysis going further.

Any ideas around how to correct this, would be highly appreciated.

Also, your feedback on my implementation of the apoc.periodic.iterate() query below will be highly appreciated.

Hi @glilienfield and @bennu_neo ,

Thank you guys a lot for the assistance that you have given to this point. I really do appreciate it.

I edited this apoc-procedure in the following way, where I split the process up into two parts:

  1. Creating/merging the relationships with the common_bh as its only property
  2. Querying the relationships to add the additional properties

The first part (and it works on the sample set):

call apoc.periodic.iterate(
    "match (i:Durban)-[]->(b)<-[]-(j:Durban)
    where not b:City and not b:Suburb and id(i) < id(j)
    return i, j, count(b) as comm_bh",
    "merge (i)-[m:MUTUAL]->(j)    // merge also creates duplicates... and it is slower
    set m.common_bh=comm_bh",
    {batchSize:10000,parallel:false}
    )

The second part where I am setting the additional properties only for the relations:

call apoc.periodic.iterate(
    "match ()-[r:MUTUAL]->()
    return r",
    "match (i)-[r]->(j)
    unwind [i.bh_degree,j.bh_degree] as bh_degs
    with i,j,r, max(bh_degs) as deg_denom
    set r.deg_denominator=deg_denom,r.strength=(r.common_bh*1.0/deg_denom)
    ",
    {batchSize:10000, parallel:false}
)

I am still running into heap error on the smaller actual dataset on the first query.

Anything that I am missing?

@bennu_neo's recommendation to filter by 'id(i)<id(j)' is correct to remove the duplicate paths when you have a symmetrical match pattern. Not sure what is going on with your result, but the following refactored version works:

match (i:`Cape Town`)-[]->(b)<-[]-(j:`Cape Town`)
where id(i) < id(j)
with i, j, count(b) as comm_bh, exists((j)--(i)) as created_ind,
case when i.bh_degree < j.bh_degree then i.bh_degree else j.bh_degree end as deg_denom
call{
    with i, j, comm_bh, deg_denom, created_ind
    with i, j, comm_bh, deg_denom, created_ind
    where not created_ind
    create (i)-[m:MUTUAL]->(j) 
    set m.strength=(comm_bh*1.0/deg_denom)
}
return i as start_node, comm_bh, deg_denom, created_ind, j as end_node

Question though, wouldn't you want to update the MUTAL relationship each time, so that you capture any changes in the values of each bh_degree property and the count of common Characteristic nodes? You will not get multiple MUTAL relationships if you used a 'merge' instead of 'create'.

HI @glilienfield

Thank you for the update above. I agree with you, @bennu_neo's suggestion makes complete sense as you reiterated.

I have executed your refactored code @glilienfield and it works perfectly on the sample set. So thank you.

However, as soon as I try it on the smallest of the actual sets (~400k nodes of ~2.2mil... which results in ~70mil edges), I get the message that the server is taking too long to respond and it drops the connection. The server seemed to still be running (result from '$neo4j_home> bin\neo4j status'), but I could not connect to it. I killed the process after about an hour and no changes were made to the db.

@glilienfield -- to your question about updating the values - my dataset is static, so there will not be any nodes/relations added to the graph. From my perspective that means that between any two nodes there will always only be static values of comm_deg and deg_denom, which gives a static strength field.

However I am new to this, so if I am misunderstanding, happy to learn :slightly_smiling_face:

At this point I am not able to get this working on my side.

Since your query does nothing else besides create and update the MATCH relationship if it does no exist, you can simplify the query further by adding the ‘not exists ((i)—(j)’ condition to the ‘where’ clause with an ‘and’ clause. This will allow you to remove the ‘call’ subquery and just have the ‘create’ clause immediately following the ‘with’ clause. The only difference will be that you will not get those filtered nodes in your output. Do you really need to output every row when you are performing a bulk update.

Either way, can you now try wrapping the cypher in a ‘call’ with ‘in transactions’, or use a similar apoc procedure to partition the operation into multiple small transactions?

Hi @joestry !

Try with periodic commit. Maybe next query could do the job.

CALL apoc.periodic.commit(
  "MATCH (i:`Cape Town`)-[]->(b)<-[]-(j:`Cape Town`)
   WHERE not exists((j)--(i))
   AND id(i) < id(j)
   WITH i, j, count(b) as comm_bh, max([i.bh_degree,j.bh_degree]) as deg_denom
   LIMIT $limit
   CREATE (i)-[:MUTUAL {strength : (comm_bh*1.0/deg_denom) }]->(j)
   RETURN count(*)",
  {limit:1000});

B

Hi @glilienfield

Thank you for the above. That would be great.

Would you mind giving an refactored example please...?

I am still new to Neo4j and still getting me head around how all these things hold together. Your explanation makes sense, however I would really appreciate your input into the actual implementation, as I have realized that my understanding of how the apoc.periodic.iterate() works is not completely accurate. (please see my code below for that implementation)

I am really appreciative of your assistance and you guys are great for replying as quickly as you do!!!

The following refactor is what I was referring too. It filters out all paths that already have the relationship existing between the two nodes. This should be acceptable since you stated the calculated results would not change, so once the relationship exists with the calculations there is no need to recalculate the values.

match (i:`Cape Town`)-[]->(b)<-[]-(j:`Cape Town`)
where id(i) < id(j)
and not exists((j)--(i))
with i, j, count(b) as comm_bh,
case when i.bh_degree < j.bh_degree then i.bh_degree else j.bh_degree end as deg_denom
create (i)-[m:MUTUAL]->(j) 
set m.strength=(comm_bh*1.0/deg_denom)
return i as start_node, comm_bh, deg_denom, j as end_node

In terms of the larger problem, it looks like you need to execute these queries in separate transactions. @bennu_neo already suggested using apoc.periodic.commit instead of apoc.periodic.iterate. Give that a try with the above query or the previous one. If that does not work, you can try plan cypher with a 'call' using 'in transactions'. You need the ':auto' if you are executing this in Neo4j Browser.

:auto call {
    match (i:`Cape Town`)-[]->(b)<-[]-(j:`Cape Town`)
    where id(i) < id(j)
    and not exists((j)--(i))
    with i, j, count(b) as comm_bh,
    case when i.bh_degree < j.bh_degree then i.bh_degree else j.bh_degree end as deg_denom
    create (i)-[m:MUTUAL]->(j) 
    set m.strength=(comm_bh*1.0/deg_denom)
    return i as start_node, comm_bh, deg_denom, j as end_node
} in transactions
return start_node, comm_bh, deg_denom, end_node

Also, if you don't need the return values, since it is a batch process, you can remove the two return statements. The returned information would only be valuable for auditing purposes, or reviewing the results. If you want the return statements, I suggest you return a property of the start and end nodes, instead of the entire node. This will reduce the size of the return information. If you don't have a good property to return, you could return the node's id, such as id(start_node) and id(end_node).

Hi guys

Just to close the loop. The above did not solve my problem entirely, however I needed to rejig my graph. If I would have done that, I think the above would be a great solution.

My problem was actually that I had a bottleneck in my design: 2.2mil -> 230 -> 2.2mil.

The pairwise comparison on this became too big for my local PC.