I have the following query:
MATCH (drug)-[r1:INHIBITS|DISRUPTS|INTERACTS_WITH]->(c2)-[r2:CAUSES|ASSOCIATED_WITH|PREDISPOSES|AUGMENTS|AFFECTS]->(covid:Concept)
USING INDEX covid:Concept(cui)
where drug.cui in ["C0031853"] and covid.cui in ["C5203670","C5203676","C5203671"]
WITH drug,r1,c2,r2,covid order by (r1.freq+r2.freq) desc limit 20
CALL {
WITH drug,r1,c2
match (c_xy)<-[r_in_xy:IS_IN]-(s_xy)<-[r_extr_xy:Extracted_From]-(i_xy:Instance {predicate: type(r1)})-[r_sub_x:Inst_Subject]->(drug),
(i_xy)-[r_obj_y:Inst_Object]->(c2)
WITH c_xy, s_xy
LIMIT 3
RETURN collect("PMID:"+c_xy.pmid+"-"+c_xy.pyear+": "+s_xy.sentence) as sent_xy
}
CALL {
WITH c2,r2,covid
match (c_yz)<-[r_in_yz:IS_IN]-(s_yz)<-[r_extr_yz:Extracted_From]-(i_yz:Instance{predicate:type(r2)})-[r_sub_y:Inst_Subject]->(c2),
(i_yz)-[r_obj_z:Inst_Object]->(covid)
WITH c_yz, s_yz
LIMIT 3
RETURN collect("PMID:"+c_yz.pmid+"-"+c_yz.pyear+": "+s_yz.sentence) as sent_yz
}
RETURN sent_xy,drug.cui+":"+drug.name+"-"+type(r1)+"-> "+c2.cui+":"+c2.name+"-"+type(r2)+"-> "+covid.cui+":"+covid.name as Drug_Y_Covid,sent_yz;
The query plan is as follows:
My question is how can I optimize this query. I think the two main issues are:
MATCH (drug)-[r1:INHIBITS|DISRUPTS|INTERACTS_WITH]->(c2)-[r2:CAUSES|ASSOCIATED_WITH|PREDISPOSES|AUGMENTS|AFFECTS]->(covid:Concept)
USING INDEX covid:Concept(cui)
where drug.cui in ["C0031853"] and covid.cui in ["C5203670","C5203676","C5203671"]
WITH drug,r1,c2,r2,covid order by (r1.freq+r2.freq) desc limit 20
As you can notice I have already used a hint index to start from covid instead of the drug. Are there any further improvements I could make? As of now, the query gets relatively fast from covid to c2, but then it needs to expand on all the relationships of c2 and that is quite slow and then also filter the drug by its cui id. In this case, index doesn't play any role.
The second issue is the filtering of the instance predicate. There are a lot of instance nodes, so this is not really fast. I have tried adding an index on the property, but the query planner doesn't use it. My question is how could I go about optimizing my query, and secondly maybe refactoring my graph for best performance. I was thinking of using labels instead of properties for predicate, but then again filtering by dynamic node label is not as effective.
match (c_xy)<-[r_in_xy:IS_IN]-(s_xy)<-[r_extr_xy:Extracted_From]-(i_xy:Instance {predicate: type(r1)})-[r_sub_x:Inst_Subject]->(drug)
Any help appreciated