2nd degree relations lookup seems very slow. Can I improve it somehow?

mike.kurzeja · June 12, 2025, 12:07pm

Hi,

I'm using Neo4j to store relations between 3 entities:
USER, PROFILE, COMPANY.

Here is my schema:

(:USER)-[:CONNECTED_WITH]->(:PROFILE)
(:PROFILE)-[:WORKED_AT]->(:COMPANY)
(:PROFILE)-[:CURRENTLY_WORKS_AT]->(:COMPANY)

1 USER may have many CONNECTED_WITH relations with different PROFILE nodes.
1 PROFILE may have many WORKED_AT relations with different COMPANY nodes.
1 PROFILE may have only 1 CURRENTLY_WORKS_AT relation with any company.

Each node has an "id" attribute and an associated index.

My goal is to extract the list of user IDs (preferably unique) connected to profiles that currently work at a given company.

This is my query:

MATCH (:COMPANY {id: "<id>"})<-[:CURRENTLY_WORKS_AT]-(:PROFILE)<-[:CONNECTED_WITH]-(u:USER)
RETURN DISTINCT u.id

The query works fine when COMPANY and PROFILE nodes have a low degree. But I have some nodes in my graph with lots of relations (like a large COMPANY), which causes Neo4j to scan its storage (I can see it on my IOPS charts), causing the queries to be very slow.

Here is a profile of an example query, where a company has over 18k CURRENTLY_WORKS_AT relations, which expands into 126k relations that end up leading to ~18k distinct USERS. This query, without profiling, took ~30s to complete the first time, before that part of the graph got pulled into the memory.

I have cases of COMPANY nodes with hundreds of thousands of CURRENTLY_WORKS_AT relations, where this query doesn't complete in any reasonable time.

My DB has ~700,000,000 nodes and ~1,700,000,000 relationships.
Total size of data and native indexes in all databases: ~256 GB.
I've got Neo4j 5.25.1 CE + APOC + DozerDB 5.25.1.0-alpha.1.
It's running on AWS EC2 m7g.4xlarge instance (16x CPU + 64 GB RAM).

I have a default memory config.

My main question is: can I do anything here to improve things, besides moving to a machine with enough RAM to keep the entire DB in?
I'm new to Neo4j and Cypher. Perhaps I have missed something fairly obvious. Or maybe there's nothing to do with a graph that gets this wide, rather than tall?

Any help will be much appreciated

joshcornejo · June 12, 2025, 2:48pm

is that database available for download ?

have you implemented indexes ?

mike.kurzeja · June 12, 2025, 4:08pm

Sorry, no, this DB is not available for download. But I can provide more info if needed

As for indices, yes, I have range indices on the "id" property of all nodes. As far as I understand, indices are used by Neo4j mostly to find the starting point of a pattern. In the profile I have shared, you can see that the starting COMPANY node is indeed located using an index.

joshcornejo · June 12, 2025, 6:13pm

Understood ...

I noticed you appear to be using the community version of Neo4J with an alpha version of an open source plugin ... so it is likely your problem could be in that plugin?

You probably should test without it ...

mike.kurzeja · June 12, 2025, 6:37pm

I find it unlikely. AFAIK, all DozerDB does is enable certain enterprise features such as support for multiple DBs and some security options. I don't think it messes with any internals of the search engine.

Having said that, if we're looking into a potential plugin effects, does it mean that Neo4j should be able to handle my queries quickly, with all the data it currently contains, high-degree nodes and all?

And, to be clear, when dealing with simple lookups by ID, or matching shorter patterns, or even the longer ones, but with nodes of a smaller degree, the performance is very good - I'm getting the first results within milliseconds. The problem with performance shows up only when the DB has to load large amount of data from the disks - which is expected. I'm just trying to get some advise on how to avoid or reduce the interaction with disks :)

joshcornejo · June 13, 2025, 6:19am

I am not sure about the internals of the plugin ... nor your configuration overall ... nor have the possibility of replicating your environment.

If you have pinpointed that the problem is disk/disk swaps then fine tuning the caching parameters is a good starting point.

This is sort of a "you're the only one that can hack" situation.

mike.kurzeja · June 13, 2025, 8:51am

What about the profile? Anything suspicious there?
I'm a bit surprised with this bit:

We're following a CONNECTED_WITH relation from PROFILE nodes to USER nodes. We can see, that this operation fetched 25k rows, but generated 126k db hits. It looks like Neo4j retrieved all relations of the PROFILE nodes (most of which are WORKED_AT, which we're not interested in here). Is that what's happening? If so, is there a way to avoid it?

As for the config, I have set the memory configuration according to the neo4j server memory-recommendation tool (as a reminder, the host system has 64 GB RAM available):

server.memory.heap.initial_size=23700m
server.memory.heap.max_size=23700m
server.memory.pagecache.size=27500m

Everything else is Neo4j defaults.

joshcornejo · June 13, 2025, 10:12am

What happens with the plan if you start separating and inverting the query?

MATCH (C:COMPANY {id: "<id>"})

MATCH (u:USER)-[:CONNECTED_WITH]->(:PROFILE)-[:CURRENTLY_WORKS_AT]->(C)

RETURN DISTINCT u.id

mike.kurzeja · June 13, 2025, 10:50am

The execution is exactly the same as with the original query. I'm getting the same profile.

joshcornejo · June 13, 2025, 10:56am

Then that means it is optimal ?

Your profile says that it is getting all cache hits - meaning they are in-memory (or in virtual memory ... i can't tell you - but they are not 'read from disk').

126K for the indices/keys is megabytes of RAM not GB - and if your total DB size is 256 GB, even in worst case scenario after all is settle it should be a handful of swaps in the worst case?)

mike.kurzeja · June 13, 2025, 1:43pm

The cache hits are there because it wasn't the first time I ran this query when I generated the profile. The "cold" time of this query was very long. This is where the disk reads occurred, of course Whenever we're reading from the cache, everything is blazing fast.

As far as I understand, Neo4j starts with an empty cache and only populates it while running queries. This means that indices are not kept in memory until they're used, and then only partially, right? And the pagecache stores not only indices, but also shares the memory with other graph data, including nodes and relationships. Unfortunately, my graph is too large to fit inside the memory. That's why I cannot rely on the pagecache to speed things up, as there will always be some queries that work on data that's not cached (either because it's the first access, or the data had been evicted).

That's why I'm wondering why there were so many DB hits when following a relation of a concrete type, even though they lead to a much smaller number of end nodes. Each of those DB hits is a potential disk read, and those add up to some very long times.

Perhaps this is just the reality that I can't work on such a large graph in realtime (at least with certain queries) without having it all inside the memory. I can accept that. But I have opened this topic to first rule out other options and mistakes on my part.

dana_canzano · June 13, 2025, 4:09pm

@mike8

maybe semantics but not entirely true. Neo4j always keeps track of what is currently in the pagecache and if you then restart, the data that was in the pagecache prior to the restart is thus automatically reloaded into the pagecache post restart. However this is not a blocking process to Neo4j start.

Froums · June 17, 2025, 4:38pm

You can try either with a subquery or rework the model. Neo4j doe's not like hyper nodes and I suppose your query generates millions of paths. You should be able to rework the relationships with a new node like a contract

mike.kurzeja · June 17, 2025, 5:45pm

That's an interesting idea. So how would you suggest to model this?
I'm not sure if I understand what you meant by this:

You should be able to rework the relationships with a new node like a contract

Froums · June 18, 2025, 6:53am

First may be something like this... straightforward. I suppose users work directly for a company.

MATCH (:COMPANY {id: ""})<-[:CURRENTLY_WORKS_AT]-(u:USER)
RETURN DISTINCT u.id

If you are interested by the profile, then you can connect (u:USER)-[:HAS]-(p:Profile)

Either you can introduce a node Contract between Company and User and adds the company Id as a property in that contract. With an index, it is very fast.

Josh's idea is also great. I would add a WITH clause to force Neo4j's plan

MATCH (C:COMPANY {id: ""})

WITH c

MATCH (u:USER)-[:CONNECTED_WITH]->(:PROFILE)-[:CURRENTLY_WORKS_AT]->(C)

RETURN DISTINCT u.id

mike.kurzeja · July 1, 2025, 9:03am

All right, thanks everyone for the replies and ideas. It seems that, indeed, there's not much I can do except reworking the model (but that can only help to a certain extent, as the business relationships between entities in my system are what they are...) or running Neo4j on a host with enough RAM to fit the entire graph.

As for the relationships traversal, I've been digging a bit more in the documentation and found out about "Dense Nodes". It seems that, when a node has over 100 relationships, they are automatically grouped by type and direction, so queries that expand specific relationship types should only ever load those and completely skip others. That's good to know, and it's another indication that I'm unlikely to find a "magical" solution that speeds it all up. I'm simply IO-bound here.

Thanks again for the input!

joshcornejo · July 1, 2025, 9:43am

It looks like you are doing a report ("list all U that are currently at C"), that's OLAP territory - meaning performance is always going to be slow.

Have you considered that perhaps you need to remodel? I am not sure what "profile" does, so I replaced it with "Role"

Simple sketched - you can have REL.end IS NULL || REL.end < date() (or even REL.end < $endDateParam for snapshots)

A more sophisticated way could be, where you have roles/profiles as templates and then the User has an instance (or instances).

mike.kurzeja · July 1, 2025, 10:00am

PROFILE is something like a "friend" or a "connection" to someone else in a social network, not anything like a role. So USER isn't directly connected to any company - they are connected with a number of PROFILEs, who in turn work (or have worked) at companies. So a relation between USER and COMPANY is transitive and always goes through PROFILE.

So you can see the combinatorical explosion of relationships: a single USER may have a relationship to many PROFILES, and each profile may have relationship to many COMPANIES.
All I'm trying to get is a set of USERS that are connected to all PROFILES that currently work at a given COMPANY (via CURRENTLY_WORKS_AT relationship, which is maintained just for that purpose).

I have thought of maintaining a "shortcut" relation directly between USER and COMPANY for the sake of faster traversal, but it still generates enough relationships to make it slow for high-degree USERS and COMPANIES.

joshcornejo · July 1, 2025, 10:04am

OK, your "currently works at" was what confused me.

Anyway a direct link won't be as slow as going through intermediate links.

As a matter of fact is the type of thing you could cache in your middleware (company, list) - when you insert a new relationship, you can invalidate the cache.

Topic		Replies	Views
Querying relationships slow performance Cypher performance , cypher , relationship	4	2051	October 15, 2020
Performance query over millions of relationships Cypher	2	2578	January 31, 2020
Slow Cypher Query Help Needed! Cypher performance	6	942	January 16, 2020
Improving the performance of a cypher query Neo4j Graph Platform	15	724	October 26, 2020
Optimising query performance with a relatively simple match Cypher performance	3	657	July 3, 2020

July Summer Fun!

2nd degree relations lookup seems very slow. Can I improve it somehow?

Related topics