Querying in cypher a subgraph from my DB

Hello everyone,

I’m looking for guidance on how to efficiently write Cypher queries for a subgraph within my Neo4j database.

Specifically, my use case is similar to the one described here (apologies for the duplicate, but the original post is 6 years old, so I’d like to know if there are updated approaches :sweat_smile:).

Context:

I have a "3D" graph where one of the axes represents time, tracking changes over time. The graph’s latest state can be queried by anchoring to a :LATEST node, which acts as a pivot for all nodes and relationships in the “current” state of the graph.

For example:

MATCH (t:TIME:LAST)<-[:AT]-(n)
OPTIONAL MATCH (n)-[r]->(m)-[:AT]->(t)
RETURN n, m, r

This returns all nodes and relationships in the 2D "latest" slice of my graph.

I also have several complex 2D queries that operate on the graph, and I’d like to run them on the latest slice while ignoring the temporal aspect whenever it’s unnecessary.

Constraints:

  1. I cannot store the latest subgraph in a separate database, as it changes frequently (multiple times per day).
  2. In the future, I’d like to be able to query any 2D slice (not just the latest) of the temporal graph.
  3. I query through Neo4j's Go Driver.

Is there an optimal solution for running queries on such subgraphs? I’m particularly curious if there are recent updates or best practices for this scenario.

Thanks in advance!

1 Like

What you are proposing makes sense. You need to find the subgraph whose root node is the most recent, then execute any query on this subgraph using the found root node as your starting place. What issues are you encountering?

2 Likes

The challenge is figuring out a way to dynamically run any query on a specific subgraph tied to a given timestamp. Ideally:

Input: A 2D query + a timestamp
Output: The query result, but limited to the subgraph for that timestamp

Neo4j doesn’t make it easy to isolate queries to a “virtual subgraph” based on another query's constraints. If there’s no clean way to do this, I’ll probably just rewrite my queries to work with the temporal structure.

Still, it’d be awesome to have a way to define a virtual subgraph and run queries within it without manually updating every query.

1 Like

Can you provide some test data, or a diagram of your data and key properties?

1 Like

Hi, thanks for the attention you are putting to our problem :slight_smile: , I am a coworker of Julien and I created a diagram of one a the situations we would like to solve

We basically need our old queries (working on a "2D" graph) to apply on our new "3D" graph without worrying about them "leaking" into past nodes.

Available if you need any additional data

1 Like

I am concerned with the two "AT" relationships from the File to each TIME nodes. You said the file did not change, but you are going to add a relationship to each new TIME node each time a new subgraph is versioned? Are you going to have an AT relationship from each node of a subgraph point back to the TIME node? This is going to be a lot of unnecessary maintenance.

Are you versioning the REPOs? If so, you can make the REPO node the root of the subgraph. All current versions of elements in the REPO when you version it point back to the REPO node. This means eliminating the two AT relationships between the FILE and TIME nodes. Instead, you get all the current versions of elements in a given REPO snapshot from their relationship with the REPO node of interest.

With this, you could do something as simple as the following. It would give you all nodes contained in the REPO as rows.

match(n:TIME:LAST)
match(r:REPO)-[:AT]->(n)
match(r)-[:CONTAINS]->(c)
return c as content

The above will give the contents mixed together, not sorted by type. Assuming you also have other types A and B stored in the repo and you want to contents of all of them grouped by label.

match(n:TIME:LAST)
match(r:REPO)-[:AT]->(n)
return {
    files: [(r)-[:CONTAINS]->(f:FILE) | f],
    a_nodes: [(r)-[:CONTAINS]->(a:A) | a],
    b_nodes: [(r)-[:CONTAINS]->(b:B) | b]
} as current_snapshot

You can return just specific properties from the nodes instead of the entire nodes.

match(n:TIME:LAST)
match(r:REPO)-[:AT]->(n)
return {
    files: [(r)-[:CONTAINS]->(f:FILE) | {file_prop1: f.file_prop1, file_prop2: f.file_prop2} ],
    a_nodes: [(r)-[:CONTAINS]->(a:A) | a{.a_prop1, .a_prop2, .a_prop3}],
    b_nodes: [(r)-[:CONTAINS]->(b:B) | b{.*}]
} as current_snapshot

NOTE: I used different forms of map projection to create the maps of properties for the different nodes to show you the scope of the operation.

Hello again !

In the case of the smaller example Pierre gave you, sure, we could version the repo and get what we want from that. But in our actual graph, the situation is more complex. We have 15+ node types, and each type can be versioned independently. This is why we decided to introduce temporal nodes as anchors. While it does increase the edge count, it allows us to query a 2D slice of the graph efficiently with a relatively small query.

For more context, our graph initially didn’t include a temporal dimension. As a result, we’ve built and relied on a lot of 2D queries that now cannot directly apply to the updated graph with the temporal structure. The temporal nodes essentially act as a snapshot mechanism, linking to all nodes and relationships present at a given time. It might not be the best design, and we're open to rethinking it if needed, but we believe it can work if we solve the main issue we’re facing.

Our true issue is figuring out the proper way to run Cypher queries on a subset of nodes and edges within the graph, essentially limiting the scope of the query to only a specific subgraph (e.g., the nodes and edges linked to a particular :TIME:LAST node). We’re not sure if there’s a best practice or an efficient way to achieve this, and that’s where we need guidance.

Thank you for the help :slight_smile:

Assuming a TIME node is the root node of a 2D subgraph, then you need to find the specific TIME node that satisfies the given timestamp, or find the LATEST if looking for the current snapshot. Once you have the TiME node, you find all its nodes using the AT relationship to each. How you efficient query any deeper beyond the AT relationship depends on the structure of the remaining subgraph.

match(n:TIME:LAST)
match(m)-[:AT]->(n)
//more complex cypher to transfer the graphs off each ‘n’ node
Return desired result structured appropriately 

I would to know more about the rest of the structure to help optimize a query.