Please keep the following things in mind:
I have the following quite thick query:
MATCH (eng:Engine {Name: $engine_name, Version: $engine_version})-[:Based_on]->(sembler:Disassembler) USING INDEX SEEK eng:Engine(Name, Version)
MATCH (disas_1:Disassembly {SHA256: $file1_SHA256})-[:Disassembled_by]->(sembler) USING INDEX SEEK disas_1:Disassembly(SHA256)
MATCH (disas_2:Disassembly {SHA256: $file2_SHA256})-[:Disassembled_by]->(sembler) USING INDEX SEEK disas_2:Disassembly(SHA256)
CALL apoc.cypher.mapParallel("
MATCH (func_1:Function {MD5: _.function1_MD5})-[:Exists_in]->(disas_1)
MATCH (func_2:Function {MD5: _.function2_MD5})-[:Exists_in]->(disas_2)
WITH func_1, func_2, _, eng
CALL apoc.do.case(
[
ID(func_1) = ID(func_2),
'RETURN 1',
_.similarity > eng.Function_similarity_limit,
'MERGE (func_1)-[func_comp:Function_compare {Engine: engine, Engine_version: engine_verison}]-(func_2)
ON CREATE SET func_comp += {Similarity: similarity, Description: description}'
],
'',
{func_1: func_1, func_2: func_2, similarity: _.similarity, description: _.description, engine: eng.Name, engine_verison: eng.Version})
YIELD value
RETURN 'success' as code",
{eng: eng, disas_1: disas_1, disas_2: disas_2},
$relation_data)
YIELD value
RETURN 'success' as code
Some high level explanation:
I dissasemble files and save their functions to a graph.
Each function node is connected to a disas node which represents the file.
I then use some function compare tools and get a big list called $relation_data that I need to iterate over which contains a function from 2 files I compared and their similarity score.
I used apoc.do.case to check for 2 edge cases:
if the function is the same in both entries then don't create a relation (don't need a self loop)
and second if similarity is too low than don't write it to the db at all.
I tried implementing mapParallel so I could parallelize the work. and it does work faster than UNWIND but not by much.
I did make an index for the Function nodes on the MD5 property. or some reason when inside apoc.do.case it tells me it cant resolve the hint when I give it. (not sure if it is relevant)
Is the query built badly ?
The database has almost 1 million Functions at the moment and 10 million relationships.
I am using neo4j 4.1 community edition with py2neo.
Now most of the time this merge query takes less than a second.
But sometimes the time it takes to execute spikes to 40-60 seconds.