Get closest parent meeting criteria

Hi all,
I have a single root tree of a view:

root <-[:PARENT]-child1
child1<-[:PARENT]-child2
child1<-[:PARENT]-child3
child3<-[:PARENT]-child4
etc

Some nodes have a user property. I need to match nodes that have no user property with one of its closest parent who this property has.
My query is:

MATCH (c:Node) WHERE c.user IS NULL
CALL {
    WITH c
    MATCH path=(p:Test)<-[:PARENT*]-(c) WHERE p.user IS NOT NULL
    RETURN nodes(path) as n order by size(n)
    limit 1
 }
RETURN c, n

For small graphs it works ok. But for the graph of several millions nodes and leaves distance from the root reaching 1000+ it takes ages.
Is there a way to optimize this query?

You can try:
STEP 1: - add user property index
CREATE INDEX node_user_index IF NOT EXISTS FOR (n:Node) ON (n.user);

STEP 2: apoc procedure to find closet ancestor

MATCH (c:Node)
WHERE c.user IS NULL

CALL apoc.path.subgraphNodes(
  c,
  {
    relationshipFilter: "<PARENT", // direction to check the path
    maxLevel: 1000, // limit to set , to prevent unlimited routes
    bfs: true,  // searching type
    filterStartNode: true, // id true the node can be ancestor of itself
    labelFilter: "+Node",  //Check only this labels
    where: "_.user IS NOT NULL"  //check only if property exist
  }
) YIELD node
WITH c, node ORDER BY length((node)-[:PARENT*]->(c)) ASC LIMIT 1
RETURN c, node AS closestAncestor

Thank you for the response.
Wouldn't the query with APOC call be equivalent this:

MATCH (c:Node) WHERE c.user IS NULL
MATCH (node:Node)<-[:PARENT*]-(c) WHERE node.user IS NOT NULL
WITH c, node ORDER BY length((node)-[:PARENT*]->(c)) ASC LIMIT 1
RETURN c, node AS closestAncestor

?
What benefit does APOC procedure give?

Update: meanwhile I tried to run this query on a small data set. It raises the error:

Type mismatch: expected Path but was List (line 15, column 30 (offset: 486))
"WITH c, node ORDER BY length((node)<-[:PARENT*]-(c)) ASC LIMIT 1"

I'm not sure how the WITH clause is handled. According to the error is seems the path provided as length() argument is calculated for the list of nodes (all that returned by previous matches) and somehow produces not one path but list of them at once instead of streaming separate paths as records, that can be sorted.

Another question is again about APOC procedure. I've tried also apoc.path.expandConfig. It provides pretty much same result and documentation describes pretty much same functionality. So, what's the difference?

The apoc path procedures can be more efficient, as the run on the server and use the Java API to traverse the graph. In doing so, they can apply the filter incrementally as they traverse.

The pattern within the length function is producing multiple paths between the two nodes, thus the error.

I agree the family of apoc path procedures do look confusing. The expand and expandConfig procedures return paths, whereas the subgraphNodes returns nodes. You should try the expand versions if you want the to order by path length or determine the node’s depth, as you get the path. In the above approach, the path is getting recreated.

Do you really need to search without a length constraint? Are your paths long? Is the neighboring nodes that meet your requirements within a few hops?

So, would that mean without subquery it's not doable?

Yes, the production network has over 3M nodes now and continuously growing, so I can't set limit anywhere, may be 1B. And the depth is unpredictable too. Some branches are purely binary (each node has no more that 2 children) some are rather linear (each node has one child). In theory I can assume that the path from the leaf to the root in worst case will be no longer than 500M.

Are you remediating your database because you didn’t set the user? Wouldn’t it be easier to set the user at time if creation?

I don’t think this is going to be efficient with a sizable database if the number of paths are large and the path lengths are great. The problem with these algorithms is that they are going to find all compliant paths, then you are going to find at least one nearest node with a user. As such, there is a lot of searching to paths that will not be utilized. If this were me, I would write a custom procedure to do this. The procedure would do a breath first search and stop the search once a node is find.

I settled on this variant:

MATCH (c:Node) WHERE c.username IS NULL
WITH c, COLLECT { MATCH (p:Node) WHERE p.username IS NOT NULL RETURN p } AS p
CALL (c, p) {
  CALL apoc.path.expandConfig(
    c, {
      minLevel: 1,
      maxLevel: 3000000,
      relationshipFilter: "PARENT>",
      labelFilter: "Node",
      terminatorNodes: p
    }
  ) YIELD path
  WITH last(nodes(path)) AS parent
  RETURN parent.username AS username
}
SET c.username = username;

Apparently this way apoc.path.expandConfig stops exactly at the first parent that meet criteria. Because the path child-[:PARENT*]->parent-[:PARENT*]->root has no branches the first parent that meets criteria would be the right one.

I'm actually not sure about where apoc.path.expandConfig stops. That came to me as a conclusion of the procedure definition, terminatorNodes description and empirical observation.

Majority of nodes are added without username, so I have to copy it from the ancestor. Because adding nodes happens separate of usernames definition I have no way to know whether added node has a username or not. So the whole process goes on like that:

  1. Add new nodes
  2. Go through known usernames and assign to corresponding nodes
  3. Fill nodes with empty usernames with ones from the ansestor
1 Like