Optimize cypher query that involves multiple match commands

Hi,

I am trying to optimize the running time of the following code:

Attempt 1:

CALL apoc.periodic.iterate(
	"MATCH  path=((a:alias) -- (c1:citation) -[p1]-> (t:BIOTERM) <-[p2]- (c2:citation) -- (b:alias))
	WHERE id(a) < id(b)  AND id(c1) <> id(c2)
	WITH a, b, c1.CITATION as CITATION1, c2.CITATION as CITATION2, p1, p2, t, 2 as precision
 	WITH a, b, CITATION1, CITATION2, p1, p2, t, 10^precision as factor
	WITH a, b, CITATION1, CITATION2, t, round(factor* (1/(2+p1.weight+p2.weight))) / factor as weight
	RETURN a, b, CITATION1, CITATION2, t, weight",
	"CREATE (a)-[e:through_topic]->(b)
	SET e.weight= weight 
	SET e.topic = t.MESH_TERM
	Set e.citation1 = CITATION1
	Set e.citation2 = CITATION2", {batchSize:2500})           
YIELD batches, total, errorMessages

Attempt 2:

CALL apoc.periodic.iterate(
	"MATCH  path=((c1:citation) -[p1]-> (t:BIOTERM) <-[p2]- (c2:citation))
	WHERE id(c1) < id(c2)  
	WITH c1.CITATION as CITATION1, c2.CITATION as CITATION2, p1, p2, t, 2 as precision
	MATCH (a:alias) --> (c1) 
 	WITH a, CITATION1, CITATION2, p1, p2, t, 10^precision as factor
 	MATCH (b:alias) --> (c2) 
        WHERE id(a) <> id(b)
	WITH a, b, CITATION1, CITATION2, t, round(factor* (1/(2+p1.weight+p2.weight))) / factor as weight
	RETURN a, b, CITATION1, CITATION2, t, weight",
	"CREATE (a)-[e:through_topic]->(b)
	SET e.weight= weight 
	SET e.topic = t.MESH_TERM
	Set e.citation1 = CITATION1
	Set e.citation2 = CITATION2", {batchSize:5000})           
YIELD batches, total, errorMessages

I feel I can do much better so the second and third MATCH statements could be parallelized. Is there a more elegant way?

Thanks,
Lavanya

Hi Lavanya,

Just trying to understand the context a bit better. Are you tring to find citations to bioterms and start to match aliases (which are indirectly linked through citation) based on citation's weight to the bioterm? Can there be more than one alias related to a citation or more than one citation to alias? (e.g. does (cit1)-(alias)-(cit2) or (a:alias)-(cit)-(b:alias) exist) if not it maybe easier to get a distinct list for each Bioterm <-[p]-(c:citation)-(a:alias) and process from that?

Not sure I fully grasp what you are doing, so if I am talking rubbish please ignore me, but I have a feeling this pattern (a:alias) -- (c1:citation) -[p1]-> (t:BIOTERM) <-[p2]- (c2:citation) -- (b:alias) is what you base your calculation off, but not really the pattern you are looking for which it sounds like is:
[(a:alias) -- (c:citation) -[p]-> (t:BIOTERM)]

as you may get a list of say 10 of these for a particular term and want to calculate each permutation

Hi @philiprichardjames,

I am trying to replace each of these paths: (a:alias) -- (c1:citation) -[p1]-> (t:BIOTERM) <-[p2]- (c2:citation) -- (b:alias) where id(a) < id(b) and id(c1) <> id(c2) with an edge from (a:alias) to (b:alias) as I do in Attempt 1. I am trying to optimize Attempt 1 since there are a lot of such paths.

My Attempt 2 is slightly off as you rightly catched (there need not be a distinct alias connected to a citation). A right approach would be trying to get all paths of the form ((c1:citation) -[p1]-> (t:BIOTERM) <-[p2]- (c2:citation)) and then get all (a:alias) -- (c1:citation) and (c2:citation) -- (b:alias) and connect all combinations of a and b - can this be parallelized? Maybe I can add an auxilliary edge between (c1:citation) and (c2:citation) and work from there?

I like your suggestions of first getting all [(a:alias) -- (c:citation) -[p]-> (t:BIOTERM)] or rather just [(c:citation) -[p]-> (t:BIOTERM)] and then looking for all combinations of citations connected to a same bioterm and then connecting all pairs of aliases connected to each combination of citations. Can this be done within a query or is it best to export the [(c:citation) -[p]-> (t:BIOTERM)] and [(c:citation) -- (a:alias)] relations as csv files and to do it outside cypher?

There are lot of possible approaches and I want to know which one would be the best.

Thanks,
Lavanya

Update: Finally figured out a solution using "OPTIONAL MATCH":

CALL apoc.periodic.iterate(
	"MATCH  ((c1:citation) -[p1]-> (t:BIOTERM) <-[p2]- (c2:citation))
	WHERE id(c1) < id(c2)
	WITH c1, c2, p1.weight as w1, p2.weight as w2, t, 2 as precision
	WITH c1, c2, w1, w2, t, 10^precision as factor
	WITH c1, c2, t, round(factor* (1/(2+w1+w2))) / factor as weight
	OPTIONAL MATCH (a) --> (c1) 
	OPTIONAL MATCH (b) --> (c2) 
	RETURN a, b, weight, t
	",
	"
	CREATE (a)-[e:through_topic]->(b)
	SET e.weight= weight, 
	 e.topic = t.MESH_TERM,
	 e.citation1 = CITATION1
	 e.citation2 = CITATION2", {batchSize:5000})          
YIELD batches, total, errorMessages

Still need to try it for my larger graph instance - but I am sure this will be faster.

If the relationship name uniquely identifies the next term is BIOTERM etc then using relationship names would be optimal than using labels.

For ex

MATCH ((c1:citation) -[p1:HAS_BIO_TERM]-> (t) <-[p2:HAS_BIO_TERM]- (c2))
....
MATCH (a) -[:HAS_ALIAS]-> (c1) 

can be faster.

Whenever a label is specified during traversal Cypher has to validate the node it reached has the specified label or not.