How to improve query- Bloom Poor Connection error

andy_hegedus · February 14, 2023, 11:14pm

I have a parameterized cypher query that works though it seems to give issues with Bloom causing poor connection errors.

Match (a:cpc)<-[:Classified_as{sequence:0}]-(b:patent)-[r:Cites{inherited:false}]->(c:patent)-[:Classified_as{sequence:0}]->(d:cpc)
where exists{(b)-[:Assigned_to]-(:company{name:$company'})-[Assigned_to]-(c)}
with a,d, count(r) as howmany
where howmany>$threshold
Call apoc.create.vRelationship(a,'Co_Uses',{num:howmany},d) yield rel
return a,d, rel

The query profile
shows a total of 5239555 db hits and I think this is chewing up memory.
For reference I have allocated

I think the issue is the order of the node queries with the node counts as follows.
Count(cpc)= 290,000
Count(patent) = 41,600
Count(company) = 2290

My intuition is to first constrain on company, then patent and finally on cpc. Thus far I have not been successful in doing that and hence the request for guidance. I have tried inverting the initial match clause with the where clause, but the references of the cpc nodes which I need for the virtual relationships do not seem to be available in the with clause.

profile
Match (b:patent)-[:Assigned_to]-(:company{name:$company})-[:Assigned_to]-(c:patent)
Where exists{ (a:cpc)<-[:Classified_as{sequence:0}]-(b:patent)-[r:Cites{inherited:false}]->(c:patent)-[:Classified_as{sequence:0}]->(d:cpc)}
with a,d, count(r) as howmany
where howmany>$threshold
Call apoc.create.vRelationship(a,'Co_Uses',{num:howmany},d) yield rel
return a,d, rel

Variable a not defined (line 4, column 6 (offset: 237))
"with a,d, count(r) as howmany"

Suggestions?

Andy

FYI system memory
dbms.memory.heap.initial_size=15G
dbms.memory.heap.max_size=15G
dbms.memory.pagecache.size=16G

andy_hegedus · February 15, 2023, 1:18am

A bit more testing and my intuition may be off.
Some simplified queries

option 1: Where the first match looks at all patents and their Cites connections
Remember count(patent) = 41,600 and count(r) is 214277

profile
Match (b:patent)-[r:Cites{inherited:false}]->(c:patent)
where exists{(b)-[:Assigned_to]-(:company{name:'intel'})-[Assigned_to]-(c)}
return b,c

This returns 5208495 total hits

Option 2 is to to do the company match first since it has the smallest node count.

profile
Match (b:patent)-[:Assigned_to]-(:company{name:'intel'})-[:Assigned_to]-(c:patent)
where exists{(b)-[r:Cites{inherited:false}]->(c)}
return b,c

This results in a 3X increase in db hits 15583355 vs 5208495

So I am not sure on how to think about the optimization. Guidance.

Andy

alison.cossette · February 15, 2023, 4:11pm

There are a couple of ways to look at understanding the difference in the hits related to these Cypher queries. Usually what I will do, is start by looking at the first MATCH clause in each to gain insight. Without having it up in front of me, I would offer the following:

Match (b:patent)-[r:Cites{inherited:false}]->(c:patent)

will traverse each patent as well as the relationship to the cited patent. So the number of node-edge-node values will be equal to the number of CITES relationships in your data. Whereas in

Match (b:patent)-[:Assigned_to]-(:company{name:'intel'})-[:Assigned_to]-(c:patent)

you are looking at a starting point that is the pairwise connection of all Intel patents so the number of node-edge-nodes connections would be the number of Intel Patents squared.

Do you know how many [r:Cites] relationships you have as well as how many [:Assigned_to] relationships you have?

MATCH ()-[r:Cites|Assigned_to]->() 
RETURN TYPE(relationship) AS type, COUNT(r) AS count
ORDER BY amount DESC;

andy_hegedus · February 15, 2023, 5:18pm

Hi Alison,

I am doing proof of concept analysis and usually am not to sensitive to performance at this stage, however it is impacting Bloom and causing poor connection errors so I need to recast this to continue my analysis.

Thank you for your comments and in light of them the match clause in my original query will effective is really bad from a performance perspective.

Match (a:cpc)<-[:Classified_as{sequence:0}]-(b:patent)-[r:Cites{inherited:false}]->(c:patent)-[:Classified_as{sequence:0}]->(d:cpc) return count(r)

returns a count of 204165 and if we square that number it is obscene.

My thoughts are that I should start with the company query for 'intel' which is 1 of 2290 followed by the patents assigned to 'intel'.

Match (:company{name:'intel'})-[:Assigned_to]-(a:patent) return count(a)

which yields 751 patents

If I then do a match to the citations

Match (c:company{name:'intel'})-[:Assigned_to]-(a:patent)-[r:Cites]->(b:patent)-[:Assigned_to]-(c) return count(r), count(distinct a), count(distinct b)

count(r)	count(distinct a)	count(distinct b)
910	290	264

I can then do the virtual relationship generation of this much smaller subset. There may be many (10-20) (:cpc) nodes but only one [:Classified{sequence:0}] for each patent.

So my query let's say I hard code the $threshold variable for the time being to 3, how would you recommend recasting the query to minimize the hits.

Match (a:cpc)<-[:Classified_as{sequence:0}]-(b:patent)-[r:Cites{inherited:false}]->(c:patent)-[:Classified_as{sequence:0}]->(d:cpc)
where exists{(b)-[:Assigned_to]-(:company{name:$company'})-[Assigned_to]-(c)}
with a,d, count(r) as howmany
where howmany>3
Call apoc.create.vRelationship(a,'Co_Uses',{num:howmany},d) yield rel
return a,d, rel

What I have done is:

Match (target:company{name:$who})
with target
Match (a:cpc)-[:Classified_as{sequence:0}]-(b:patent)-[r:Cites{inherited:false}]->(c:patent)-[:Classified_as{sequence:0}]-(d:cpc)
where exists{(b)-[:Assigned_to]-(target)-[Assigned_to]-(c)}
with a,d, count(r) as howmany
where howmany>3
Call apoc.create.vRelationship(a,'Co_Uses',{num:howmany},d) yield rel
return a,d, rel

This seems to be a bit more responsive, but if we follow the above logic the clause in the where exists should probably come first and the cpc classified as be the where exist clause. If I try that the issue is that [r:Cites] variable is not available for the count summation.

Is there a way to get around this?

Andy

alison.cossette · February 15, 2023, 6:33pm

Let me poke through this a bit:

One thing you can try is to reevaluate the data model itself. Queries will always be slower when you have to bring in a property. You could consider changing the relationship type to include the property.

For example,
"Classified_as{sequence:0}" to "Classified_as_0"
"Cites{inherited:false}" to "Cites_inherited_false"

By extracting properties you have many more types of relationships but it is worth evaluating the impact on hits. It is not uncommon for us to change the projection of the model in order to optimize for a given query. This flexibility is one of the strengths of a GraphDB.

Topic		Replies	Views
Cypher Query In Bloom and passing parameter Neo4j Bloom	2	914	March 8, 2021
Help to tune cql Cypher apoc	2	215	August 1, 2023
Cypher query optimization General migrated	13	166	August 30, 2022
Invalid attempt to spread non-iterable instance Neo4j Bloom	6	833	June 19, 2020
Please help me tune my query Cypher performance , cypher	3	437	September 25, 2021

How to improve query- Bloom Poor Connection error

Related topics