Hi everyone,
TLDR: A newbie loads a large graph, is disappointed with traversal performance, seeks advice.
I work on the KCIDB project, a part of KernelCI project, sponsored by The Linux Foundation. The KCIDB project is trying to aggregate results coming from all CI systems testing the Linux Kernel (such as Red Hat CKI, Intel 0day, Google Syzbot, Linaro Tuxsuite, KernelCI native testing, and others), unify reporting, and provide long-term analysis. So far we store testing results in Google BigQuery (prototype dashboard available), and generate basic result notifications based on various criteria.
One of our next steps is implementing known-issue triaging, where we have a database of registered known issues, look for them in incoming test results, and amend test outcomes based on what is found. We have to cope with constant incoming test failures, because it's unfeasible to test every change on every possible bit of hardware, and fix issues before merging.
To be able to intelligently pick which issues should be triaged for which revision, and which issues were found previously we have to reason about revision connectivity (Git commit graph, plus patches attached to it), and answer questions such as: "is this revision based on another revision", "is this revision between these revision", "what are the last 50 revisions before this one", "what are 10 releases (tagged commits) before this one", and so on, joining these with other data, such as triaging results, known issues, and so on.
I loaded minimum information (commit hash, parent commit hash, author name and email, and commit subject) on complete history of the Linux mainline branch into neo4j Desktop 1.4.15 on my laptop, resulting in a graph with 1088612 nodes and 1176193 relations, and created an index on the commit hash.
I then tried making a few queries we might use, and hit a performance wall. Attempting a naive approach to checking if a commit is reachable from another commit:
MATCH (s:Commit {hash:'3123109284176b1532874591f7c81f3837bbdc17'})
MATCH (e:Commit {hash:'1da177e4c3f41524e886b7f1b8a0c1fc7321cac2'})
WHERE exists((s)-[:PARENT*0..]->(e))
RETURN true
using the start and end commits in the history, completes in about 38ms, great!
But doing the same for just 34 hops doesn't complete in 20 minutes:
MATCH (s:Commit {hash:'3123109284176b1532874591f7c81f3837bbdc17'})
MATCH (e:Commit {hash:'d0d96121d03d6d9cf608d948247a9f24f5a02da9'})
WHERE exists((s)-[:PARENT*0..]->(e))
RETURN true
It gets better with APOC (completes in 356ms):
MATCH (s:Commit {hash:'3123109284176b1532874591f7c81f3837bbdc17'})
MATCH (e:Commit {hash:'d0d96121d03d6d9cf608d948247a9f24f5a02da9'})
CALL apoc.path.expandConfig(s, {
relationshipFilter: 'PARENT>',
terminatorNodes: e,
limit: 1
})
YIELD path
RETURN length(path)
But once you go longer, to a subgraph with 85 nodes (a path through it would be shorter), even with a level limit:
MATCH (s:Commit {hash:'3123109284176b1532874591f7c81f3837bbdc17'})
MATCH (e:Commit {hash:'68e77ffbfd06ae3ef8f2abf1c3b971383c866983'})
CALL apoc.path.expandConfig(s, {
relationshipFilter: 'PARENT>',
terminatorNodes: e,
maxLevel: 100,
limit: 1
})
YIELD path
RETURN length(path)
it just runs out of memory after a few minutes. Other similar queries wouldn't return in at least 10 minutes (after which I'd terminate them).
Needless to say, that the last query would take git about half a second, without an explicit level limit, starting the process from scratch, and finding *all* the paths, not just one.
Am I doing something wrong, or am I just using the wrong tool for the job?
Any advice of how (and how much) I can improve this with neo4j, or perhaps a suggestion of another graph database?
Thank you for any responses!
P.S. The sample I tried this on is just one branch of one repo. We need to store and query a whole bunch of repos, some of them with multiple branches. Those would reuse the bulk of history, but there would be considerable data churn, so we could probably end up with 3x the graph size quickly.