cancel
Showing results for 
Search instead for 
Did you mean: 

Modeling and querying package dependency trees

cjmartian
Node

Dear Community,

we are Connor and Daniel, Software Engineers at Anaconda, and we are currently looking at neo4j to represent our package meta data to answer questions like:

  • What packages are available for python X on platform Y but not on platform Z?
  • What missing dependencies do I need to build in which order, if I want to build package X (currently available on platfrom Y) on platform Z?
  • What are all the downstream packages of package X (to e.g. execute their tests, when package X is updated)

Background: Package managers like conda, pip, apt, yum solve dependency trees to install a package via SAT solvers. Loading dependency trees into a Graph database like neo4j feels most natural to model relations, to query, visualize and compare graphs and sub-graphs with inspiring articles:

But can this be applied to conda packages, where multiple versions are available for each package at the same time with specificly ranged version constraints?

Each package contains a index.js file that describes it dependencies:

...
"depends": [
"bleach",
"bokeh >=2.4.0,<2.5.0",
"markdown",
"param >=1.12.0",
"pyct >=0.4.4",
"python >=3.8,<3.9.0a0",
"pyviz_comms >=0.7.4",
"requests",
"tqdm >=4.48.0"
],
...

Model Idea Untitled graph.png

Specifics:

  • The blue nodes are virtual packages that represent a library that packages exist for but not a real package (e.g. python, numpy)
  • The orange nodes are real packages that can be installed (e.g. `conda install numpy`)
  • Each package can have 0-n type:run dependencies:
    • The dependencies from package to package are not drawn directly as relation between real packages and their dependencies are constaint based on a virtual package (and multiple package version can fullfil those)
    • The dependencies are instead drawn as relation to a version-less blue virtual package node, where relation properties tell what version would satisfy the dependency
  • Finding all dependency of a package including all the indirect transitive dependencies requires to match from the real package through the blue virtual package to the next real package, idea:
MATCH (X:real {name: numpy, version: 1.19.2}) -[r:DEPENDS_ON {type: "run"}]->
(Y:virtual) <-[s:PROVIDES]-
(Z:real)
WHERE s.version in r.constraints
RETURN z.name, max(z.version), max(z.build)
  • Open model questions:
    • How to apply above MATCH pattern recursively to find all transitive dependencies of each dependency
    • How to apply the "s.version in r.constraints" condition .. we can normalize the versions to comparable integers before data ingestion and split the constraints into multiple versions and their compare-operators, which results into 3 where conditions:
      • >=, <, =!
    • How to handle duplicates within the transitive dependency relations for the same package, but with slightly different constraints?
    • How to prefer the max version of packages (max(z.version))?
    • How to do that in an efficient manner?
  • The ultimate question: Can the problem of finding all the dependencies of a package including their order be solved by neo4j and cypher queries (maybe together with a path search algorithms like A*)?

Any feedback, hints, model ideas appreciated.

Thanks

2 REPLIES 2

cjmartian
Node

CC @dbast 

Thanks!