cancel
Showing results for 
Search instead for 
Did you mean: 

Performance Trade-Off of WHERE vs. Refactoring in dense graphs

I am in, what I would think, a common situation where I seek to model a network of nodes and weighted, timestamped relationships.

The graph will have less than a million nodes, but a huge amount of edges. Density may be high especially given that there may be parallel edges with different timestamps.

Primary use case: Query a node and its relationships, subset edges via weight or timestamp. This is easily done with a WHERE clause.
Worst case scenario: A set of nodes and some of their relationships are queried to "subset" a large portion of the network in node and time dimension.
Specifically, my assumption is that given the low amount of nodes, subsetting edges via time or weight leads to subgraph that fits into RAM, which will be very beneficial. The full graph, of course, will be too large for such things.

Before populating the network I'll need some general understanding of the performance trade-off. All operations are costly for me, in terms of time and money, and I hope you can give me some pointers.

My research has shown that there are two designs

  • (a) -[r:tie {time:x, weight:y}]->(b)

  • (a)-[r]->(c:tie {time:x, weight:y})->[r]->(b)

So in other words, relationship as node or as edge.

If I follow the philosophy behind Neo4j to the letter, then I would have to choose the first option since the starting point of any query is a node or a set of nodes, and relationships are NOT entities in their own right (cf. e-mail example).
As such, WHERE on "r" allow for an elegant solution that preserves the logic of the underlying situation.

It is my intuition, however, that the first option will not be performant and I should choose the second option before having to refactor afterwards (again, everything is costly).
The reason is that due to having many edges and few nodes, my graph probably does not fit the performance model of Neo4j well. It will probably be beneficial to have an index on time, because even though I technically enter via a node, the "width" of the computation occurs via dense edges. Then, queries could also be rewritten to subset ties first via the index and get away which much cheaper computations (given the few actual nodes).

I'd humbly ask whether you can confirm or deny that intuition, or if you would prefer to stick to relationship properties given that they reflect the underlying reality?
I realize the answer is likely "it depends", but I do not have the resources for ex-ante benchmarking. This is why I ask for your experience, as this must be a fairly common situation.

I'd like to be careful: maybe WHERE is not at all costly in these situations? Or perhaps, given the amount of edges, transforming edges to nodes leads to a huge network with many nodes which would still be dense with edges that have no property. Perhaps that makes it difficult to use the "weight" parameter efficiently for computation?

Thank you for your insights, or any experience you have in these situations.

0 REPLIES 0
Nodes 2022
Nodes
NODES 2022, Neo4j Online Education Summit

On November 16 and 17 for 24 hours across all timezones, you’ll learn about best practices for beginners and experts alike.