2nd degree relations lookup seems very slow. Can I improve it somehow?

Hi,

I'm using Neo4j to store relations between 3 entities:
USER, PROFILE, COMPANY.

Here is my schema:

(:USER)-[:CONNECTED_WITH]->(:PROFILE)
(:PROFILE)-[:WORKED_AT]->(:COMPANY)
(:PROFILE)-[:CURRENTLY_WORKS_AT]->(:COMPANY)

1 USER may have many CONNECTED_WITH relations with different PROFILE nodes.
1 PROFILE may have many WORKED_AT relations with different COMPANY nodes.
1 PROFILE may have only 1 CURRENTLY_WORKS_AT relation with any company.

Each node has an "id" attribute and an associated index.

My goal is to extract the list of user IDs (preferably unique) connected to profiles that currently work at a given company.

This is my query:

MATCH (:COMPANY {id: "<id>"})<-[:CURRENTLY_WORKS_AT]-(:PROFILE)<-[:CONNECTED_WITH]-(u:USER)
RETURN DISTINCT u.id

The query works fine when COMPANY and PROFILE nodes have a low degree. But I have some nodes in my graph with lots of relations (like a large COMPANY), which causes Neo4j to scan its storage (I can see it on my IOPS charts), causing the queries to be very slow.

Here is a profile of an example query, where a company has over 18k CURRENTLY_WORKS_AT relations, which expands into 126k relations that end up leading to ~18k distinct USERS. This query, without profiling, took ~30s to complete the first time, before that part of the graph got pulled into the memory.

I have cases of COMPANY nodes with hundreds of thousands of CURRENTLY_WORKS_AT relations, where this query doesn't complete in any reasonable time.

My DB has ~700,000,000 nodes and ~1,700,000,000 relationships.
Total size of data and native indexes in all databases: ~256 GB.
I've got Neo4j 5.25.1 CE + APOC + DozerDB 5.25.1.0-alpha.1.
It's running on AWS EC2 m7g.4xlarge instance (16x CPU + 64 GB RAM).

I have a default memory config.

My main question is: can I do anything here to improve things, besides moving to a machine with enough RAM to keep the entire DB in?
I'm new to Neo4j and Cypher. Perhaps I have missed something fairly obvious. Or maybe there's nothing to do with a graph that gets this wide, rather than tall?

Any help will be much appreciated :slight_smile:

is that database available for download ?

have you implemented indexes ?

Sorry, no, this DB is not available for download. But I can provide more info if needed :slight_smile:

As for indices, yes, I have range indices on the "id" property of all nodes. As far as I understand, indices are used by Neo4j mostly to find the starting point of a pattern. In the profile I have shared, you can see that the starting COMPANY node is indeed located using an index.

Understood ...

I noticed you appear to be using the community version of Neo4J with an alpha version of an open source plugin ... so it is likely your problem could be in that plugin?

You probably should test without it ...

I find it unlikely. AFAIK, all DozerDB does is enable certain enterprise features such as support for multiple DBs and some security options. I don't think it messes with any internals of the search engine.

Having said that, if we're looking into a potential plugin effects, does it mean that Neo4j should be able to handle my queries quickly, with all the data it currently contains, high-degree nodes and all?

And, to be clear, when dealing with simple lookups by ID, or matching shorter patterns, or even the longer ones, but with nodes of a smaller degree, the performance is very good - I'm getting the first results within milliseconds. The problem with performance shows up only when the DB has to load large amount of data from the disks - which is expected. I'm just trying to get some advise on how to avoid or reduce the interaction with disks :)

I am not sure about the internals of the plugin ... nor your configuration overall ... nor have the possibility of replicating your environment.

If you have pinpointed that the problem is disk/disk swaps then fine tuning the caching parameters is a good starting point.

This is sort of a "you're the only one that can hack" situation.

What about the profile? Anything suspicious there?
I'm a bit surprised with this bit:

We're following a CONNECTED_WITH relation from PROFILE nodes to USER nodes. We can see, that this operation fetched 25k rows, but generated 126k db hits. It looks like Neo4j retrieved all relations of the PROFILE nodes (most of which are WORKED_AT, which we're not interested in here). Is that what's happening? If so, is there a way to avoid it?

As for the config, I have set the memory configuration according to the neo4j server memory-recommendation tool (as a reminder, the host system has 64 GB RAM available):

server.memory.heap.initial_size=23700m
server.memory.heap.max_size=23700m
server.memory.pagecache.size=27500m

Everything else is Neo4j defaults.

What happens with the plan if you start separating and inverting the query?

MATCH (C:COMPANY {id: "<id>"})

MATCH (u:USER)-[:CONNECTED_WITH]->(:PROFILE)-[:CURRENTLY_WORKS_AT]->(C)

RETURN DISTINCT u.id

The execution is exactly the same as with the original query. I'm getting the same profile.

Then that means it is optimal ?

Your profile says that it is getting all cache hits - meaning they are in-memory (or in virtual memory ... i can't tell you - but they are not 'read from disk').

126K for the indices/keys is megabytes of RAM not GB - and if your total DB size is 256 GB, even in worst case scenario after all is settle it should be a handful of swaps in the worst case?)

The cache hits are there because it wasn't the first time I ran this query when I generated the profile. The "cold" time of this query was very long. This is where the disk reads occurred, of course :slight_smile: Whenever we're reading from the cache, everything is blazing fast.

As far as I understand, Neo4j starts with an empty cache and only populates it while running queries. This means that indices are not kept in memory until they're used, and then only partially, right? And the pagecache stores not only indices, but also shares the memory with other graph data, including nodes and relationships. Unfortunately, my graph is too large to fit inside the memory. That's why I cannot rely on the pagecache to speed things up, as there will always be some queries that work on data that's not cached (either because it's the first access, or the data had been evicted).

That's why I'm wondering why there were so many DB hits when following a relation of a concrete type, even though they lead to a much smaller number of end nodes. Each of those DB hits is a potential disk read, and those add up to some very long times.

Perhaps this is just the reality that I can't work on such a large graph in realtime (at least with certain queries) without having it all inside the memory. I can accept that. But I have opened this topic to first rule out other options and mistakes on my part.

@mike8

maybe semantics but not entirely true. Neo4j always keeps track of what is currently in the pagecache and if you then restart, the data that was in the pagecache prior to the restart is thus automatically reloaded into the pagecache post restart. However this is not a blocking process to Neo4j start.