Versioning and diffing large hierarchical data with Neo4j

Wenzel · April 1, 2021, 10:06pm

Hi,

I'm interested by using Neo4j as a version control database for a large hierarchical tree (which is a filesystem like a source code tree, or bigger).

My pet project is neogit, an implementation of a git client, backed by Neo4J.

Git uses a Merkle-Tree like structure to store its data

git

I have implemented the same model:

And imported my data (example on neo4j source code repository)

And I made a little tet, modifying a specific file and recapturing the while tree.

As you can see in the following screenshot, PublicApiDoclet.java has been modified, and a new set of Trees has been created for the new commit, but they have relationship with the old Trees as well, since the rest of the files are unchanged.

Colors:
Branch
Grey: Commit
Tree
Blob

the commit -[:OWNS_FILESYSTEM]-(root_tree:Tree)
and the each Tree -[:HAS_CHILD_TREE | :HAS_CHILD_BLOB]->(:Tree | :Blob)
with each relationship having a name property corresponding to the child name (file or directory)

The main issue I'm facing now is how to query the difference between 2 commits, and highlight
the new state of files:

created
deleted
modified

Diffing Cypher implementation

My current algorithm to query the differences is the following:

1 Get the list of modified Tree in the new commit
2 For each modified Tree

get the full path
check if there is an existing Tree at this path in the old commit
compare the old Tree and the new Tree respective children for differences

In the following query example, I'm trying to find the modification between 2 releases
of Windows 10.

1 win10-20H1-xxxx
2 win10-20h2-xxxx

Get the list of modified Trees

Slow: Checking all the Trees originating from the new commit, but not having the old commit as ancestor.

MATCH (new_os:OS {name: "win10-20H2-2009.19042.631"})-[:OWNS_FILESYSTEM]->(new_root:Tree)
WITH new_root
MATCH (old_os {name: "win10-20H1-2004.19041.264"})-[:OWNS_FILESYSTEM]->(old_root:Tree)
WITH old_root,new_root
MATCH (new_root)-[:HAS_CHILD_TREE*]->(new_child_t:Tree)
WHERE NOT (old_root)-[*]->(new_child_t)
RETURN count(new_child_t)

This method is unbearably slow, so I reworked the query to filter using the calculated sha1sum.

Fast : sha1sums filter

MATCH (new_os:OS {name: "win10-20H2-2009.19042.631"})-[:OWNS_FILESYSTEM]->(new_root:Tree)
WITH new_root
MATCH (old_os {name: "win10-20H1-2004.19041.264"})-[:OWNS_FILESYSTEM]->(old_root:Tree)
WITH old_root,new_root
// collect old SHA1 trees
MATCH (old_root)-[:HAS_CHILD_TREE*]->(old_child_tree:Tree)
WITH collect(old_child_tree.sha1sum) as old_sha1_list
// filter new trees
MATCH new_paths = (new_root)-[:HAS_CHILD_TREE*]->(new_child_tree:Tree)
WHERE NOT new_child_tree.sha1sum IN old_sha1_list
RETURN new_child_tree.sha1sum, new_paths

Query for an equivalent Tree at the old_commit

The goal now is to get the filesystem path from each new_paths collected
before, and build a new query that would retrieve the old Tree, if it exists.

This has to be done by a script outside of cypher as there is not way to build a variable-length pattern dynamically (AFAIK)

    tree_path = Path("/Windows/System32/cmd.exe")
    tree_match_partial = ["MATCH path = (old_root_fs:Tree {sha1sum: $old_root_sha1})"]
    for path_part_index, path_part in enumerate(tree_path.parts[1:]):
        # {
        #     'path_part_0': 'Windows',
        #     'path_part_1': 'System32',
        # }
        param_var_name = f'path_part_{path_part_index}'
        params[param_var_name] = cypher_escape(path_part)
        rel = f'-[{{name: ${param_var_name}}}]->(:Tree)'
        # -[name: "Windows"]->(:Tree) ......
        tree_match_partial.append(rel)

Get the children

This is trivial, once we have both Trees to be compared:

    MATCH (t:Tree {sha1sum: $sha1})-[r:HAS_CHILD_TREE|HAS_CHILD_BLOB]->(c)
    RETURN r.name as filename, c.sha1sum as sha1sum

Performance issues.

If we sum up the algorithm, it takes a lot of queries to achieve a full diff of this tree structure:

1 query to get the modified Trees
n queries to get the related Trees
n queries to get their respective children

Unfortunately this obliterates the performance of my git diff command.

The goal would be to do the full diff in a single Cypher query.

I would like your suggestions, both on the data model and the Cypher language to improve this situtation.

I'm ccing @lju , since her talk: "How to keep track of change - versioning approaches in Neo4j" is clearly on the topic here :)

neo4j version: 4.2.4, Ubuntu 20.04, Docker
Official Python3 driver

Thank you very much for your help !

lju · June 30, 2021, 9:49am

Hi @Wenzel

Completely missed this, so sorry!

Will have a think about this - do you have a repo with the data you loaded into the graph?

Topic		Replies	Views
Scalability of large Merkle Tree diffing in Neo4j Neo4j Graph Platform performance , cypher	1	59	September 4, 2024
Dynamically create variable-length pattern where each relationship get it's property from a list Cypher performance	1	404	June 17, 2020
Querying Linux Kernel Git history with neo4j Neo4j Graph Platform migrated	14	188	August 8, 2022
How to compare 2 different versions (basically JSONs) from neo4j Neo4j Graph Platform cypher , compare , knowledge-base , json	1	798	May 16, 2022
Dimitris from Oxford, UK Introduce-Yourself	0	185	October 5, 2020

August Summer Fun!