Hi,
I'm interested by using Neo4j as a version control database for a large hierarchical tree (which is a filesystem like a source code tree, or bigger).
My pet project is neogit
, an implementation of a git client, backed by Neo4J.
Git uses a Merkle-Tree like structure to store its data
I have implemented the same model:
And imported my data (example on neo4j source code repository)
And I made a little tet, modifying a specific file and recapturing the while tree.
As you can see in the following screenshot, PublicApiDoclet.java
has been modified, and a new set of Trees has been created for the new commit, but they have relationship with the old Trees as well, since the rest of the files are unchanged.
Colors:
Branch
Grey: Commit
Tree
Blob
the commit -[:OWNS_FILESYSTEM]-(root_tree:Tree)
and the each Tree -[:HAS_CHILD_TREE | :HAS_CHILD_BLOB]->(:Tree | :Blob)
with each relationship having a name
property corresponding to the child name (file or directory)
The main issue I'm facing now is how to query the difference between 2 commits, and highlight
the new state of files:
- created
- deleted
- modified
Diffing Cypher implementation
My current algorithm to query the differences is the following:
1 Get the list of modified Tree
in the new commit
2 For each modified Tree
- get the full path
- check if there is an existing
Tree
at this path in the old commit - compare the old
Tree
and the newTree
respective children for differences
In the following query example, I'm trying to find the modification between 2 releases
of Windows 10.
1 win10-20H1-xxxx
2 win10-20h2-xxxx
Get the list of modified Trees
Slow: Checking all the Trees originating from the new commit, but not having the old commit as ancestor.
MATCH (new_os:OS {name: "win10-20H2-2009.19042.631"})-[:OWNS_FILESYSTEM]->(new_root:Tree)
WITH new_root
MATCH (old_os {name: "win10-20H1-2004.19041.264"})-[:OWNS_FILESYSTEM]->(old_root:Tree)
WITH old_root,new_root
MATCH (new_root)-[:HAS_CHILD_TREE*]->(new_child_t:Tree)
WHERE NOT (old_root)-[*]->(new_child_t)
RETURN count(new_child_t)
This method is unbearably slow, so I reworked the query to filter using the calculated sha1sum.
Fast : sha1sums filter
MATCH (new_os:OS {name: "win10-20H2-2009.19042.631"})-[:OWNS_FILESYSTEM]->(new_root:Tree)
WITH new_root
MATCH (old_os {name: "win10-20H1-2004.19041.264"})-[:OWNS_FILESYSTEM]->(old_root:Tree)
WITH old_root,new_root
// collect old SHA1 trees
MATCH (old_root)-[:HAS_CHILD_TREE*]->(old_child_tree:Tree)
WITH collect(old_child_tree.sha1sum) as old_sha1_list
// filter new trees
MATCH new_paths = (new_root)-[:HAS_CHILD_TREE*]->(new_child_tree:Tree)
WHERE NOT new_child_tree.sha1sum IN old_sha1_list
RETURN new_child_tree.sha1sum, new_paths
Query for an equivalent Tree at the old_commit
The goal now is to get the filesystem path from each new_paths
collected
before, and build a new query that would retrieve the old Tree, if it exists.
This has to be done by a script outside of cypher as there is not way to build a variable-length pattern dynamically (AFAIK)
tree_path = Path("/Windows/System32/cmd.exe")
tree_match_partial = ["MATCH path = (old_root_fs:Tree {sha1sum: $old_root_sha1})"]
for path_part_index, path_part in enumerate(tree_path.parts[1:]):
# {
# 'path_part_0': 'Windows',
# 'path_part_1': 'System32',
# }
param_var_name = f'path_part_{path_part_index}'
params[param_var_name] = cypher_escape(path_part)
rel = f'-[{{name: ${param_var_name}}}]->(:Tree)'
# -[name: "Windows"]->(:Tree) ......
tree_match_partial.append(rel)
Get the children
This is trivial, once we have both Trees to be compared:
MATCH (t:Tree {sha1sum: $sha1})-[r:HAS_CHILD_TREE|HAS_CHILD_BLOB]->(c)
RETURN r.name as filename, c.sha1sum as sha1sum
Performance issues.
If we sum up the algorithm, it takes a lot of queries to achieve a full diff of this tree structure:
1 query to get the modified Trees
n
queries to get the related Trees
n
queries to get their respective children
Unfortunately this obliterates the performance of my git diff
command.
The goal would be to do the full diff in a single Cypher query.
I would like your suggestions, both on the data model and the Cypher language to improve this situtation.
I'm ccing @lju , since her talk: "How to keep track of change - versioning approaches in Neo4j" is clearly on the topic here :)
- neo4j version:
4.2.4
, Ubuntu 20.04, Docker - Official Python3 driver
Thank you very much for your help !