Deleting subgraphs based on timestamps - similar to TTL


(Benjamin Squire) #1

I have a graph with static timestamp (datetime) properties on nodes labelled Users in the graph. These timestamps indicate the first and last observation, both of which are indexed, of each user when the data was loaded. I am trying to maintain 1 weeks worth of data which means each day I want to delete the first day of the week, before I add the newest day. Further, user nodes are connected to third-party ids which do not have any timestamps on them.

The method I am trying to achieve is to delete all users from the first day whose 4 hop subgraph has no timestamps beyond the first day.

A couple of approaches I am thinking:
1.) Collect all users who are from the first day
2.) Using Unwind - For each user in that collection, run apoc.subgraphNodes with max level set to 4 and return all user nodes in the subgraph, if any of them have a last observation after the first day do not delete any of the ids, Else, if no users in the subgraph have a last observation after the first day. Delete/ Mark to delete all nodes in the subgraphs.

1.) Py2Neo? do everything in python?

I am looking for helping setting this query up and suggestions on optimizations. This is a toy example, in production we are aiming to maintain 1-2 years of data and it may have around 5-15 Million users per day to consider for deletion out of 1.8 Billion User nodes.

What I have so far:

Match (u:User) where datetime(u.last_obs) < datetime('2018-01-02') with u limit 10 with collect(u) as coll unwind coll as u CALL apoc.path.subgraphNodes(u, {maxLevel:2,filterStartNode:true,relationshipFilter:'OBSERVED_WITH',labelFilter:'>User'}) yield node return node.last_obs  order by node.last_obs DESC

This returns all users in each subgraph for the original 10 (with u limit 10), I want to wrap a case when statement with a limit so for each user, I run a case when on the subgraphNodes with a limit 1 on ordered by node.last_obs DESC such that if the latest 'last_obs' is greater than or equal to '2018-01-02' then I want to call all nodes using call subgraphNodes with no label filter and mark/delete all nodes in the subgraph.

  • neo4j version : 3.4.9 community
  • Possibly using py2neo. Otherwise just cypher or Apoc
  • a sample of the data you want to import

(Jasper Blues) #2

You can probably set up GraphAware's Neo4j expire to do what you want (

You can automatically expire (delete) after a week, after evaluating additional criteria based on a graph traversal.

This module runs continuously as a managed extension, by default, when the CPU is otherwise idle.

(Benjamin Squire) #3

My solution is as follows, still working to optimize it but this worked:
The basic idea is 1.) mark all users who have a subgraph that extends past the first day in parallel
2.) mark all users who do not
3.) delete all users who do not (note this deals with a race condition)
4.) remove all labels related to this process

Match (u:User) where u.last_obs < datetime('2018-01-02') return count(u) limit 4;
| count(u) |
| 4124550  |
Call apoc.periodic.iterate("Match (u:User) where u.last_obs < datetime('2018-01-02') return u", "CALL apoc.path.subgraphNodes(u, {maxLevel:2,filterStartNode:true,relationshipFilter:'OBSERVED_WITH',labelFilter:'>User'}) yield node where node.last_obs >= datetime('2018-01-02') SET u:Keep", {batchSize:100, iteratelist:true,parallel:true,retries:3});
Call apoc.periodic.iterate("Match (u:User) where u.last_obs < datetime('2018-01-02') and not u:Keep return u","CALL apoc.path.subgraphNodes(u, {maxLevel:2,filterStartNode:true,relationshipFilter:'OBSERVED_WITH'}) yield node SET node:MarkDel",{batchSize:100,iteratelist:true,parallel:true,retries:3});
Call apoc.periodic.iterate("Match (d:MarkDel) return d", "Detach delete d", {batchSize:100, iterateList:true, parallel:true,retries:3});