Versioning and diffing large hierarchical data with Neo4j

Hi,

I'm interested by using Neo4j as a version control database for a large hierarchical tree (which is a filesystem like a source code tree, or bigger).

My pet project is neogit, an implementation of a git client, backed by Neo4J.

Git uses a Merkle-Tree like structure to store its data

git

I have implemented the same model:

And imported my data (example on neo4j source code repository)

And I made a little tet, modifying a specific file and recapturing the while tree.

As you can see in the following screenshot, PublicApiDoclet.java has been modified, and a new set of Trees has been created for the new commit, but they have relationship with the old Trees as well, since the rest of the files are unchanged.

Colors:
:green_circle: Branch
Grey: Commit
:yellow_circle: Tree
:red_circle: Blob

the commit -[:OWNS_FILESYSTEM]-(root_tree:Tree)
and the each Tree -[:HAS_CHILD_TREE | :HAS_CHILD_BLOB]->(:Tree | :Blob)
with each relationship having a name property corresponding to the child name (file or directory)

The main issue I'm facing now is how to query the difference between 2 commits, and highlight
the new state of files:

  • created
  • deleted
  • modified

Diffing Cypher implementation

My current algorithm to query the differences is the following:

1 Get the list of modified Tree in the new commit
2 For each modified Tree

  • get the full path
  • check if there is an existing Tree at this path in the old commit
  • compare the old Tree and the new Tree respective children for differences

In the following query example, I'm trying to find the modification between 2 releases
of Windows 10.

1 win10-20H1-xxxx
2 win10-20h2-xxxx

Get the list of modified Trees

Slow: Checking all the Trees originating from the new commit, but not having the old commit as ancestor.

MATCH (new_os:OS {name: "win10-20H2-2009.19042.631"})-[:OWNS_FILESYSTEM]->(new_root:Tree)
WITH new_root
MATCH (old_os {name: "win10-20H1-2004.19041.264"})-[:OWNS_FILESYSTEM]->(old_root:Tree)
WITH old_root,new_root
MATCH (new_root)-[:HAS_CHILD_TREE*]->(new_child_t:Tree)
WHERE NOT (old_root)-[*]->(new_child_t)
RETURN count(new_child_t)

This method is unbearably slow, so I reworked the query to filter using the calculated sha1sum.

Fast : sha1sums filter

MATCH (new_os:OS {name: "win10-20H2-2009.19042.631"})-[:OWNS_FILESYSTEM]->(new_root:Tree)
WITH new_root
MATCH (old_os {name: "win10-20H1-2004.19041.264"})-[:OWNS_FILESYSTEM]->(old_root:Tree)
WITH old_root,new_root
// collect old SHA1 trees
MATCH (old_root)-[:HAS_CHILD_TREE*]->(old_child_tree:Tree)
WITH collect(old_child_tree.sha1sum) as old_sha1_list
// filter new trees
MATCH new_paths = (new_root)-[:HAS_CHILD_TREE*]->(new_child_tree:Tree)
WHERE NOT new_child_tree.sha1sum IN old_sha1_list
RETURN new_child_tree.sha1sum, new_paths

Query for an equivalent Tree at the old_commit

The goal now is to get the filesystem path from each new_paths collected
before, and build a new query that would retrieve the old Tree, if it exists.

This has to be done by a script outside of cypher as there is not way to build a variable-length pattern dynamically (AFAIK)

    tree_path = Path("/Windows/System32/cmd.exe")
    tree_match_partial = ["MATCH path = (old_root_fs:Tree {sha1sum: $old_root_sha1})"]
    for path_part_index, path_part in enumerate(tree_path.parts[1:]):
        # {
        #     'path_part_0': 'Windows',
        #     'path_part_1': 'System32',
        # }
        param_var_name = f'path_part_{path_part_index}'
        params[param_var_name] = cypher_escape(path_part)
        rel = f'-[{{name: ${param_var_name}}}]->(:Tree)'
        # -[name: "Windows"]->(:Tree) ......
        tree_match_partial.append(rel)

Get the children

This is trivial, once we have both Trees to be compared:

    MATCH (t:Tree {sha1sum: $sha1})-[r:HAS_CHILD_TREE|HAS_CHILD_BLOB]->(c)
    RETURN r.name as filename, c.sha1sum as sha1sum

Performance issues.

If we sum up the algorithm, it takes a lot of queries to achieve a full diff of this tree structure:

1 query to get the modified Trees
n queries to get the related Trees
n queries to get their respective children

Unfortunately this obliterates the performance of my git diff command.

The goal would be to do the full diff in a single Cypher query.

I would like your suggestions, both on the data model and the Cypher language to improve this situtation.

I'm ccing @lju , since her talk: "How to keep track of change - versioning approaches in Neo4j" is clearly on the topic here :)

  • neo4j version: 4.2.4, Ubuntu 20.04, Docker
  • Official Python3 driver

Thank you very much for your help ! :slightly_smiling_face:

Hi @Wenzel

Completely missed this, so sorry!

Will have a think about this - do you have a repo with the data you loaded into the graph?