Scalability of large Merkle Tree diffing in Neo4j

Wenzel · August 28, 2024, 4:23am

Hello Neo4j community and Cypher experts,

I've successfully implemented a proof of concept with Neo4j, but I'm facing challenges as I try to scale my solution. I'm looking for guidance on optimizing operations involving large Merkle Trees, particularly the diff process.

Current implementation: client-side parallelization and diff

Given 2 Tree hashes ($base, $diffee), I gather all the children and their relationships:

MATCH (t:Tree)-[r:HAS_CHILD_BLOB|HAS_CHILD_TREE]->(c)
WHERE t.hash IN [$base, $diffee]
RETURN t.hash as parent_hash, type(r) as type, r.name as name, c.hash as child_hash

Compute the difference (NEW,MOD,DEL)
For each MOD child subtree detected, call the diff recursively
For each NEW/DEL child subtree, query all the subblobs at any length, and do that in parallel with Promise.all()

While functional, this approach takes about 45 seconds for a full diff, potentially extending to 10 minutes for deeply nested Merkle trees. I'm also encountering thread starvation errors due to numerous parallel queries.

ERROR Increase in thread starvation events detected (2 events over a period of 600000 ms) - This may indicate an overload condition

Because I'm running so many parallel queries at the same time to try to utilize all the cores.

Things i've tried

Moving the diff on Neo4j side.

I tried implementing the entire diff in Cypher. It performs well for trees with a reasonable number of children (~30ms), but struggles with larger trees (e.g., 13,000 children), taking over 8 minutes to respond.

Started streaming 26818 records after 1 ms and completed after 510546 ms, displaying first 1000 rows.

Moving the parallelization on Neo4j side

Another thing I tried was to push the parallelization of the queries inside Neo4j, leveraging apoc.cypher.mapParallel.

The goal was to perform less queries from the driver side, reducing the thread starvation:

CALL apoc.cypher.mapParallel(
    'MATCH (t:Tree)-[r:HAS_CHILD_BLOB|HAS_CHILD_TREE]->(c)
    WHERE t.hash IN [_.base, _.diffee]
    WITH _, t.hash as parent_hash, collect({type: type(r), name: r.name, hash: c.hash}) as children
    RETURN collect({parent_hash: parent_hash, children: children}) as result, _.base as base_hash, _.diffee as diffee_hash, _.path as base_path',
    {},
    $hash_list
) YIELD value
RETURN value

However this turn out to be much slower than my current implementation, and remains unconclusive at the moment, possibly due to the overhead of parallelization for short-lived queries ?

Questions

Ideally, I would like to implement the diff in a single Cypher querying traversing the whole MerkleTree at once, instead of performing thousands of little subqueries for each individual Tree.

However, I couldn't express what I wanted with Cypher constructs.

Should I create my own User defined Java procedure to do this or is there a way to leverage Cypher without investing costly development and maintenance time ?

Additionaly, as I'm running Neo4j via the Docker image, are there any hardcoded memory settings that would prevent Neo4j from using the whole available host memory ?

Of course I need to add that I've inserted unique constraints on the hash properties for faster lookup.

I appreciate any insight you may have, and suggestions !

Context:

Neo4j Docker v5.23
Python and Javascript drivers + GraphQL library

myron_higerd · September 4, 2024, 5:57pm

Not sure I understand your requirements completely but it appears you are trying to determine what changed so then the following should help. This assumes that you are looking for changes in the children that are captured as part of the diffee tree. If the child didn't change then both trees will contain the same child name and hash.

The below cypher has not been tested but is a pattern I have used frequently.

MATCH (t:Tree)-[r:HAS_CHILD_BLOB|HAS_CHILD_TREE]->(c)
WHERE t.hash IN [_.base, _.diffee]
WITH t.hash as parent_hash, {type: type(r), name: r.name, hash: c.hash} as child
// Determine number of parents for each child
WITH _, child, collect(parent_hash) as parent_hashes
// Keep only records where there is only one parent, base or diffee
WHERE size(parent_hashes) = 1
// If a base parent tree is found then the child is the old version and the action is DEL
// If a diffee parent tree is found then the child is the new version and the action is NEW
// An update is a combination of a DEL and a NEW
RETURN _, CASE WHEN parent_hashes[0] = _.base THEN 'DEL' ELSE 'NEW' END AS action, child, parent_hash

Topic		Replies	Views
Versioning and diffing large hierarchical data with Neo4j Cypher performance , cypher	1	572	June 30, 2021
Dynamically create variable-length pattern where each relationship get it's property from a list Cypher performance	1	404	June 17, 2020
Creating 200K relationships to a node is taking a lot of time in Neo4J 3.5? Neo4j Graph Platform	18	7295	November 18, 2021
apoc.refactor.mergeNodes Performance Neo4j Graph Platform apoc , performance , migrated , mergenode	6	343	November 28, 2022
Speeding up apoc.refactor.mergeNodes query Cypher apoc , performance , cypher , relationship	1	224	April 28, 2023

August Summer Fun!