Performance drop in query when trying to run on ~4000 items with apoc.do.case

atesterguy · September 13, 2021, 4:10pm

Please keep the following things in mind:
I have the following quite thick query:

MATCH (eng:Engine {Name: $engine_name, Version: $engine_version})-[:Based_on]->(sembler:Disassembler) USING INDEX SEEK eng:Engine(Name, Version)

MATCH (disas_1:Disassembly {SHA256: $file1_SHA256})-[:Disassembled_by]->(sembler) USING INDEX SEEK disas_1:Disassembly(SHA256)

MATCH (disas_2:Disassembly {SHA256: $file2_SHA256})-[:Disassembled_by]->(sembler) USING INDEX SEEK disas_2:Disassembly(SHA256)

        CALL apoc.cypher.mapParallel("
            MATCH (func_1:Function {MD5: _.function1_MD5})-[:Exists_in]->(disas_1)
            MATCH (func_2:Function {MD5: _.function2_MD5})-[:Exists_in]->(disas_2)
            WITH func_1, func_2, _, eng
            CALL apoc.do.case(
                [
                ID(func_1) = ID(func_2),
                'RETURN 1',
                _.similarity > eng.Function_similarity_limit,
                'MERGE (func_1)-[func_comp:Function_compare {Engine: engine, Engine_version: engine_verison}]-(func_2) 
                ON CREATE SET func_comp += {Similarity: similarity, Description: description}'
                ],
                '',
                {func_1: func_1, func_2: func_2, similarity: _.similarity, description: _.description, engine: eng.Name, engine_verison: eng.Version})
            YIELD value
            RETURN 'success' as code",
            {eng: eng, disas_1: disas_1, disas_2: disas_2}, 
            $relation_data) 
        YIELD value 
        RETURN 'success' as code

Some high level explanation:
I dissasemble files and save their functions to a graph.

Each function node is connected to a disas node which represents the file.

I then use some function compare tools and get a big list called $relation_data that I need to iterate over which contains a function from 2 files I compared and their similarity score.

I used apoc.do.case to check for 2 edge cases:
if the function is the same in both entries then don't create a relation (don't need a self loop)
and second if similarity is too low than don't write it to the db at all.

I tried implementing mapParallel so I could parallelize the work. and it does work faster than UNWIND but not by much.

I did make an index for the Function nodes on the MD5 property. or some reason when inside apoc.do.case it tells me it cant resolve the hint when I give it. (not sure if it is relevant)

Is the query built badly ?

The database has almost 1 million Functions at the moment and 10 million relationships.

I am using neo4j 4.1 community edition with py2neo.

Now most of the time this merge query takes less than a second.
But sometimes the time it takes to execute spikes to 40-60 seconds.

atesterguy · September 14, 2021, 3:06pm

I have simplified my execution to try and avoid the apoc.do.case

Thought it might be a bottle neck as well.

        MATCH (eng:Engine {Name: $engine_name, Version: $engine_version})-[:Based_on]->(sembler:Disassembler) USING INDEX SEEK eng:Engine(Name, Version)
        MATCH (disas_1:Disassembly {SHA256: $file1_SHA256})-[:Disassembled_by]->(sembler) USING INDEX SEEK disas_1:Disassembly(SHA256)
        MATCH (disas_2:Disassembly {SHA256: $file2_SHA256})-[:Disassembled_by]->(sembler) USING INDEX SEEK disas_2:Disassembly(SHA256)

        CALL apoc.cypher.mapParallel('
            MATCH (func_1:Function)-[:Exists_in]->(disas_1) WHERE func_1.MD5 = _.function1_MD5
            MATCH (func_2:Function)-[:Exists_in]->(disas_2) WHERE func_2.MD5 = _.function2_MD5

            MERGE (func_1)-[func_comp:Function_compare {Engine: eng.Name, Engine_version: eng.Version}]-(func_2)
                ON CREATE SET func_comp += {Similarity: _.similarity, Description: _.description}

            ', {eng: eng, disas_1: disas_1, disas_2: disas_2}, $relation_data)

        YIELD value
        RETURN 'success' as code

But this query returns:

Failed to invoke procedure `apoc.cypher.mapParallel`: Caused by: org.neo4j.graphdb.security.AuthorizationViolationException: Create relationship with type 'Function_compare' is not allowed for user 'neo4j' with FULL overridden by READ overridden by READ.

Any clue on why is this ?

Topic		Replies	Views
Parallel Cypher & Apoc Cypher apoc , cypher	8	3938	June 19, 2019
Bottleneck on apoc.when Cypher performance	5	379	October 2, 2021
Speeding up apoc.refactor.mergeNodes query Cypher apoc , performance , cypher , relationship	1	224	April 28, 2023
How to execute apoc.do.case procedure in parallel? Cypher	3	287	March 11, 2021
Improve performance of apoc.refactor.mergeNodes Conferences, Meetups, & Events migrated	6	157	December 21, 2022

Get Certified in June!

Performance drop in query when trying to run on ~4000 items with apoc.do.case

Related topics