Hello Neo4j community and Cypher experts,
I've successfully implemented a proof of concept with Neo4j, but I'm facing challenges as I try to scale my solution. I'm looking for guidance on optimizing operations involving large Merkle Trees, particularly the diff process.
Current implementation: client-side parallelization and diff
- Given 2 Tree hashes (
$base, $diffee
), I gather all the children and their relationships:
MATCH (t:Tree)-[r:HAS_CHILD_BLOB|HAS_CHILD_TREE]->(c)
WHERE t.hash IN [$base, $diffee]
RETURN t.hash as parent_hash, type(r) as type, r.name as name, c.hash as child_hash
- Compute the difference (
NEW,MOD,DEL
) - For each
MOD
child subtree detected, call the diff recursively - For each
NEW/DEL
child subtree, query all the subblobs at any length, and do that in parallel withPromise.all()
While functional, this approach takes about 45 seconds for a full diff, potentially extending to 10 minutes for deeply nested Merkle trees. I'm also encountering thread starvation errors due to numerous parallel queries.
ERROR Increase in thread starvation events detected (2 events over a period of 600000 ms) - This may indicate an overload condition
Because I'm running so many parallel queries at the same time to try to utilize all the cores.
Things i've tried
Moving the diff on Neo4j side.
I tried implementing the entire diff in Cypher. It performs well for trees with a reasonable number of children (~30ms
), but struggles with larger trees (e.g., 13,000 children
), taking over 8 minutes to respond.
Started streaming 26818 records after 1 ms and completed after 510546 ms, displaying first 1000 rows.
Moving the parallelization on Neo4j side
Another thing I tried was to push the parallelization of the queries inside Neo4j, leveraging apoc.cypher.mapParallel
.
The goal was to perform less queries from the driver side, reducing the thread starvation:
CALL apoc.cypher.mapParallel(
'MATCH (t:Tree)-[r:HAS_CHILD_BLOB|HAS_CHILD_TREE]->(c)
WHERE t.hash IN [_.base, _.diffee]
WITH _, t.hash as parent_hash, collect({type: type(r), name: r.name, hash: c.hash}) as children
RETURN collect({parent_hash: parent_hash, children: children}) as result, _.base as base_hash, _.diffee as diffee_hash, _.path as base_path',
{},
$hash_list
) YIELD value
RETURN value
However this turn out to be much slower than my current implementation, and remains unconclusive at the moment, possibly due to the overhead of parallelization for short-lived queries ?
Questions
Ideally, I would like to implement the diff in a single Cypher querying traversing the whole MerkleTree at once, instead of performing thousands of little subqueries for each individual Tree.
However, I couldn't express what I wanted with Cypher constructs.
Should I create my own User defined Java procedure to do this or is there a way to leverage Cypher without investing costly development and maintenance time ?
Additionaly, as I'm running Neo4j via the Docker image, are there any hardcoded memory settings that would prevent Neo4j from using the whole available host memory ?
Of course I need to add that I've inserted unique constraints on the hash
properties for faster lookup.
I appreciate any insight you may have, and suggestions !
Context:
- Neo4j Docker v5.23
- Python and Javascript drivers + GraphQL library