Resources to understand implications of graph structure on query perf?

Howdy all,

Neo4j Noob here. I am considering using neo4j for a startup I'm working on and am trying to wrap my head around query perf for the graph model we'd have. The model maps easily to a graph datastructure, but in order to justify the risk of using a relatively unknown tech, I need to understand perf implications of the model and query patterns. We here are all familiar and comfortable with RDBMSes, so can reason about what the perf implications are there, being mostly a factor of how many relationships there are and and how may round trips would be needed, etc.

What I am struggling with is understanding what the general perf characteristics are for neo4j for a given model. For instance, what is the max number of hops that I can query against and still get reasonable perf (our users would be waiting for this query to complete)? How does the number of nodes and relationships impact the the perf queries? Is it reasonable to return nodes of arbitrary hops away from a starting node? etc.

Are there any good primers I could read that would help me understand this better? I've read through the official docs, and while helpful from a getting started perspective, they haven't really helped me understand best practices around data models as it relates to query patterns and perf.

Thanks!

Since graph dbs use pointers between nodes and relationships instead of join tables, your performance for queries involving multiple traversals is proportional to the data that you traverse and filter, and not the total data (such as the number of nodes of a particular label or the number of relationships of a particular type).

Query tuning and performance is then all about ensuring you touch the smallest portion of the graph in order order to get the desired results, as well as to optimize queries to minimize cardinality and avoid applying redundant operations multiplicatively.

This usually consists of:

  1. Ensuring you have an index or unique constraint for starting nodes in your graph so your initial lookups are quick. The schema sections in the docs should have you covered here.

  2. Ensuring you have a model that favors narrow selection of relevant relationships vs more generic relationships and filtering of the node on the other end (or filtering based on properties of the relationship or other node). While often there's no way around this except to expand and filter, if you can use specific relationships that are doing the filtering for you based just on the types, you can often see a considerable performance boost.
    Max De Marzi is one of our specialists for modeling and tuning for performance. He talks a bit about this kind of modeling in this blog entry, but you'll probably want to scan through all of his blog posts to look for modeling topics that match up to your needs.

  3. Understanding how Cypher works (especially when it comes to cardinality in Cypher queries) so you can avoid pitfalls, make modeling decisions that pair well with expected Cypher queries, and profile to troubleshoot/optimize queries. I highly recommending reading through our knowledge base article on cardinality in Cypher queries, and going through other Cypher articles in the knowledge base.

While graph dbs have great power in that you have lots of flexibility in how you connect and model your data, this flexibility means there are many more options for how to model your data, and some modeling decisions will be less straight forward than others. With experience you can begin to get a feel for modeling smells, which should push you toward refactoring your model, and in some cases it will require query profiling to reveal weaknesses in your model.

Some other resources to help you out here:

Links in our top-level post in our Modeling section of the community site
The Modeling section of the community site
Modeling Designs blog post (links up with Max De Marzi's blog posts)
Data modeling pitfalls

2 Likes

Awesome! Thank you for the information, very helpful. I'll read through the docs provided.

I've used Neo4j on two of my own startups and never looked back:

  1. A social network for musicians and music lovers, 400,000 users and growing, all queries return under 30ms, except for recommendation engine, which returns ~2 seconds, which is still acceptable, however I have a future solution that will bring it back to near real-time.

  2. Recently have begun to assist a joint venture company that is undertaking digital transformation in sport - modelling the sporting community and associated processes as a network.

Besides the excellent information provided by Andrew Bowman:

Thanks @jasperblues, great input. Glad to hear that you have been happy with your choice. I do have questions on operating the db, but I'll look for the write place to ask that question.

I'm curious about your model and query patterns for your social network, as what I'm wanting to do is a very similar thing, though in a very different domain. I assume there is some sort of an "Activity Feed" within your network. Do you mind sharing a little bit about what worked from a model perspective? I can imagine that you could have user <-friend-> user , user-belongs->group, user-shares->post, post-to->group, user-shares->comment, comment-belongsto->post or something. Do you then just query for all posts up to 2 hops away, order by creation date desc? what about for comments on the post that you want to surface as well? It seems like that starts to get to be a non-scalable solution, because if you want to sort by date, you end up looking at potentially a ton of nodes. Thinking while I type, perhaps that means you put the date on the relationship instead of the node to prevent one extra level of traversal.

Or perhaps I just have the model wrong? Any thoughts you could share would be very helpful.

Thanks in advance!
-kssea

1 Like

In depends on your domain. In our case we have have the concept of a (temporal) inbox, and we compute what goes into there using a managed extension, based on the kinds of characteristics that you describe. Then the ultimate read query is just (:User)-[:HAS_ITEM)-(:InBox)-<-[:AUTHOR_OF]-(:User), which is very cheap.

I plan to do the same approach for recommendations.

Btw, did you see? https://twitter.com/emileifrem/status/1067789758401662976

Ah, so a denormalized activity store of sorts. I presume you have some sort of activity ingestion pipeline, where when a user shares something, it gets written to the primary db (the same graph db or elsewhere?) and then gets post-processed and attached to all friends of that user that haven't muted them (or something). Is that it? And this managed extension you speak of, is that a procedure or something else? How does it come into play?

And thanks for sharing the tweet. Really impressive! Just read the interview with the dev involved, very cool.

Sorry @kssea I didn't get a notification about your reply. Yes that's it - you got it!

Managed extensions: They are processes that run inside Neo4j, and can be configured, eg for when the CPU is idle. Unfortunately the documentation link seems to have broken:

https://neo4j.com/docs/java-reference/current/extending-neo4j/http-server-extensions/