Questions about my query model

victor · February 27, 2025, 4:37am

Hi folks, I am building a Graph API where I can get nodes and edges subsequently in the same request.

I am having a bit of trouble trying to optimize this query (please ignore literal values as I replace them with parameters on the application). I am getting out of ideas on how to make them better and use less memory (main issue).

Basically my query get lists of a specific type o node from another one, then orders by insertion_time and then I use subqueries to get some more inner nodes. We have a lot of types of queries, but I will examplify with one of them.

Basically I am using a combination of COMBINE and DISTINCT and then limiting the nodes so I can prepare them for the following query, because it seems better to filter them before going to the sub queries.

Here is the query:

	MATCH (root_node:Entity:Business {source_id:"0101-0003-0006-00000001"})-[lvl_0_main_relationship_can_read:CAN_READ]-> (lvl_0:Entity:Membership)
		  WITH lvl_0,
		  root_node
	ORDER BY lvl_0.insertion_time DESC
	WHERE (lvl_0.status IN [ "ACTIVE",
					  "DISABLED",
					  "PENDING" ] )
		  AND (lvl_0.insertion_time <= "2025-02-27T03:37:12.1748779Z")
		  WITH COLLECT(DISTINCT(lvl_0))[0..20] as lvl_0_collected,
		  root_node
	UNWIND lvl_0_collected AS lvl_0
	CALL {
		  WITH lvl_0,
				root_node

		  OPTIONAL MATCH (root_node)-[:CAN_READ]->(lvl_0)-[lvl_0_child_0_main_relationship_can_read:CAN_READ]-> (lvl_0_child_0:Entity:AssetGroup)
		  WITH lvl_0_child_0,
				lvl_0,
				root_node
		  ORDER BY lvl_0_child_0.insertion_time desc
		  WHERE (lvl_0_child_0.status IN [ "ACTIVE" ] )
		  WITH collect(distinct(lvl_0_child_0))[0..20] as lvl_0_child_0_collected,
				lvl_0,
				root_node
		  UNWIND lvl_0_child_0_collected as lvl_0_child_0
		  CALL {
		  WITH lvl_0_child_0

				OPTIONAL MATCH (lvl_0_child_0)-[:BELONGS_TO]-> (lvl_0_child_0_child_0:Tenant)
		  WITH lvl_0_child_0_child_0,
					  lvl_0_child_0
		  WITH collect(distinct(lvl_0_child_0_child_0))[0..1] as lvl_0_child_0_child_0_collected,
					  lvl_0_child_0
				UNWIND lvl_0_child_0_child_0_collected as lvl_0_child_0_child_0
				RETURN COLLECT(DISTINCT({node: lvl_0_child_0_child_0,
										children:[]})) as lvl_0_child_0_child_0_collected }
		  RETURN COLLECT(DISTINCT({node: lvl_0_child_0,
								  children:[{edge: "tenants",
											  nodes: lvl_0_child_0_child_0_collected}]})) as lvl_0_child_0_collected 
	}
	CALL {
		  WITH lvl_0

		  OPTIONAL MATCH (lvl_0)-[:BELONGS_TO]-> (lvl_0_child_1:Tenant)
		  WITH DISTINCT lvl_0_child_1
		  LIMIT 1
		  RETURN COLLECT({node: lvl_0_child_1,
							children:[]}) as lvl_0_child_1_collected 
	}
	RETURN COLLECT({node: lvl_0,
					  children: [{edge: 'edge1',
								  nodes: lvl_0_child_0_collected},
							{edge: 'tenants',
								  nodes: lvl_0_child_1_collected}] })

What I would like some help is to understand if the issue is the "COLLECT/DISTINCT" that I am using, and if there is some better way to sort the results, filter and then run the subquery. This pattern seems to do the least dbhits I could find, but I think that it is not the most memory-optimized.

My db has a pattern of indexes for {Label} on source_id and for Entity on insertion_time.

But I think the main issue is the too long profile that it generates. I Would appreciate some ideas to make it more optimal, maintaining the response. I am running out of ideias, so I would really appreciate any insight.

If there is any missing info, please let me know

Explain:

joshcornejo · February 27, 2025, 8:22am

Hi Victor,

At first glance, you are creating a query, then using the output converted as a list, and passed as a parameter to the next query, this parameter goes back into the following query (rather than being a continuation of the query), so that is probably why your EXPLAIN looks so long.

Why not start with consolidating the queries?

MATCH (root_node:Entity:Business { source_id:"0101-0003-0006-00000001" })-[:CAN_READ]->(lvl_0:Entity:Membership)-[:CAN_READ*0..1]->(lvl_0_child_0:Entity:AssetGroup)
WITH lvl_0, root_node
  WHERE (lvl_0.status IN [ "ACTIVE", "DISABLED", "PENDING" ] )
  AND (lvl_0_child_0.status IN ["ACTIVE"])
  AND (lvl_0.insertion_time <= "2025-02-27T03:37:12.1748779Z")

You can also probably leave the "order by" to the end (I use epoch's - long - and do the sort in the consumer of the query, but that works in my case as the results are usually a handful)

Also, might it be useful if you show a graph?

victor · February 27, 2025, 11:09am

Hi Josh! Thank you so much for your time trying to help me.

I try to make this chained queries so that I can get the chained response, here is an example os the query response:

I may want to make the graph call:

https://{host}/{version}/{business_id}/memberships?fields=asset_groups?limit=4

The response looks very similar to Meta's Graph api, which is where I took my inspiration from.

This would return the following graph:

you can see that I returned 4 "Memberships" Nodes and its respectived connected "AssetGroup" nodes.

Here is an example json: JSON Editor Online: edit JSON, format JSON, query JSON

What do you mean in your example of "consolidating the queries"? You mean making multiples cypher calls instead of a single one?

It is interesting doing the order by in the consumer, but I think in my case would waste too much processing and get a lot of unsused data. I was trying to get the complete chained nodes on neo4j so that I could continue my application's request.

Also. I do the order by and filter to the list before going to the subquery so that I can only process the nodes that are going to return. Let's say there are 100 Membership nodes in the specified Business Node, I would only get 4 of them, so that I can get subnodes only from each of them, it was the way I found to saved up dbhits. Please, let me know if this strategy makes sense. But I understand that the order by is one of the main reasons why it takes up so much processing, but I was not able to change it yet, because as I said, my application returns the nodes always in this order (most recent).

joshcornejo · February 27, 2025, 11:14am

I mean doing less MATCH statements and more refinement parts.

but you already mentioned 20M nodes? so you are getting unused data anyway - depends on how many clients you have, and the rate of change of your nodes?

Yes - that makes sense, but you seem to have a few order by hence why I recommended doing it once in the client

joshcornejo · February 27, 2025, 11:16am

And re:GraphQL, here is something I sent to someone else on a different thread somewhere else (courtesy of ChatGPT):

GraphQL offers flexibility and efficiency in API design, but it comes with scalability challenges, especially as applications grow. Some key scalability problems include:

1. N+1 Query Problem

When fetching nested data, GraphQL might execute multiple small queries instead of an optimized batch request.
Example: Fetching a list of users and their posts might trigger separate database queries for each user, leading to excessive database hits.

Solution: Use data loaders or batch queries to reduce redundant requests.

2. Complex Queries Leading to Performance Issues

Clients can construct deeply nested or overly broad queries, resulting in slow database operations.
Unlike REST, where endpoints have fixed responses, GraphQL’s flexibility allows clients to request large amounts of data in a single query.

Solution: Implement query complexity analysis or depth limiting to prevent inefficient queries.

3. Caching Difficulties

REST APIs benefit from HTTP caching at the CDN level, but GraphQL queries are typically sent via POST requests, making caching harder.
Dynamic query structures mean responses are less predictable, reducing cache reusability.

Solution: Use persistent queries and response caching strategies at the resolver or database layer.

4. Authorization & Security Risks

More complex permission handling due to fine-grained data access.
Malicious users can craft expensive queries to overload the backend (e.g., requesting deeply nested data).

Solution: Implement query cost analysis, rate limiting, and field-level authorization.

5. Schema Complexity & Maintainability

Large applications with many types and resolvers can lead to difficult-to-maintain schemas.
Versioning can be tricky since GraphQL avoids traditional REST versioning practices.

Solution: Use schema stitching or federation for modular design, and deprecate fields instead of versioning.

6. Scaling GraphQL Servers

Since GraphQL servers handle complex query parsing and execution, they require more CPU/memory than simpler REST endpoints.
High-traffic applications need efficient resource management to prevent bottlenecks.

Solution: Use efficient resolver patterns, load balancing, and consider serverless GraphQL approaches.

Would you like specific recommendations for scaling GraphQL in your use case?

victor · February 27, 2025, 11:34am

Here in the example, I using a test database, but in production we have over 10M nodes. We have a considerated amount of consumers already, and they make these type os requests constantly, some of them are a bit more long, but it always looks like this example. The rate of change of the nodes happen constantly, usually the creation of more nodes. I have limited each subtree to have a maximum os nodes, so a single node would not have an infinite amount of sub-nodes of the same type connected to them.

Well, I guess I am, but the filtering I mentioned doing avoids from getting unrequired sub nodes from nodes that would not be in the response, but it does not seem to be a good trade, as I am looking at the metrics. The CPU is really fine, the main issue is the memory, this is the optimization I am look for

Please forgive me if I did not understant correctly, but what you are trying to recommend is the following:

FIrst, the the first-level nodes:

MATCH (root_node:Entity:Business {source_id:"0101-0003-0006-00000001"})-[lvl_0_main_relationship_can_read:CAN_READ]-> (lvl_0:Entity:Membership)
WHERE (lvl_0.status IN [ "ACTIVE",
				  "DISABLED",
				  "PENDING" ] )
	  AND (lvl_0.insertion_time <= 12123131231231
	  return lvl_0 order by lvl_0.insertion_time_unix_time

And Then, only after getting the response, do the subsequent calls separately on other cypher requests for the returned nodes, something like:

MATCH (root_node:Entity:Membership)-[:CAN_READ]-> (lvl_0:Entity:AssetGroup)
WHERE root_node.source_id in ["1","2","3"] AND (lvl_0.status IN [ "ACTIVE"] )
	  AND (lvl_0.insertion_time_unix_time<= 12123131231231
	  return lvl_0 order by lvl_0.insertion_time_unix_time

That could very well be a possibility. I was considering it, if I could not return the chained request in a optimal way, but I was considering if there would be a tweak that I could make to the request so that it would shorten the EXPLAIN. the main issue seems to be the order by indeed.

I understand that graphql really gives this kind of problem, but I am trying to optimize in a single query first. Maybe it is better to have more dbhits and consume less query memory than the other way around? I don't know, I am thinking about that now.

joshcornejo · February 27, 2025, 11:38am

In this message, I consolidated 2 of your original MATCH into one, you could also aim at doing the sort here once (as it will be in the filtered output of 2 subsets):

joshcornejo:

Why not start with consolidating the queries?

MATCH (root_node:Entity:Business { source_id:"0101-0003-0006-00000001" })-[:CAN_READ]->(lvl_0:Entity:Membership)-[:CAN_READ*0..1]->(lvl_0_child_0:Entity:AssetGroup)
WITH lvl_0, root_node
  WHERE (lvl_0.status IN [ "ACTIVE", "DISABLED", "PENDING" ] )
  AND (lvl_0_child_0.status IN ["ACTIVE"])
  AND (lvl_0.insertion_time <= "2025-02-27T03:37:12.1748779Z")

victor · February 27, 2025, 11:45am

Well indeed, sorry about that. And that's really a great ideia.

The "Membership" may not have connected "AssetGroup", so would it make sense to do something likes this then?

MATCH (root_node:Entity:Business { source_id:"0101-0003-0006-00000001" })-[:CAN_READ]->(lvl_0:Entity:Membership)
OPTIONAL MATCH (lvl_0)-[:CAN_READ]->(lvl_0_child_0:Entity:AssetGroup)
WITH lvl_0, root_node, lvl_0_child_0
  WHERE (lvl_0.status IN [ "ACTIVE", "DISABLED", "PENDING" ] )
  AND (lvl_0_child_0.status IN ["ACTIVE"])
  AND (lvl_0.insertion_time <= "2025-02-27T03:37:12.1748779Z")
 return {node: lvl_0,children: lvl_0_child_0} limit 20

Only because if I use the single MATCH, it would only return Memberships that have AssetGroups, it seems

victor · February 27, 2025, 11:52am

victor:

MATCH (root_node:Entity:Business { source_id:"0101-0003-0006-00000001" })-[:CAN_READ]->(lvl_0:Entity:Membership)
OPTIONAL MATCH (lvl_0)-[:CAN_READ]->(lvl_0_child_0:Entity:AssetGroup)
WITH lvl_0, root_node, lvl_0_child_0
  WHERE (lvl_0.status IN [ "ACTIVE", "DISABLED", "PENDING" ] )
  AND (lvl_0_child_0.status IN ["ACTIVE"])
  AND (lvl_0.insertion_time <= "2025-02-27T03:37:12.1748779Z")
 return {node: lvl_0,children: lvl_0_child_0} limit 20

Also, it seems that if the Membership has 2 AssetGroup, it is returning as 2 items in the response, instead of 1, with 2 "children" nodes. I am trying to re-build this query to see if there is a way to consolidate it in one node response. Also I forgot on the query to limit the subnode. I will share it here if I am able to build the correct query.

victor · February 27, 2025, 12:05pm

Don't know if this is too much already, but i would look something likes?

PROFILE MATCH (root_node:Entity:Business { source_id:"0101-0003-0006-00000001" })-[:CAN_READ]->(lvl_0:Entity:Membership)
OPTIONAL MATCH (lvl_0)-[:CAN_READ]->(lvl_0_child_0:Entity:AssetGroup)
WITH lvl_0, root_node, lvl_0_child_0
  WHERE (lvl_0.status IN [ "ACTIVE", "DISABLED", "PENDING" ] )
  AND (lvl_0.insertion_time <= "2025-02-27T03:37:12.1748779Z")
  AND (lvl_0_child_0.status IN ["ACTIVE"])
  WITH lvl_0_child_0, lvl_0, root_node
  order by lvl_0_child_0.insertion_time desc limit 20

 return {node: lvl_0,children: lvl_0_child_0} order by lvl_0.insertion_time limit 20

It was the way I found to return the same response, really downsizing the EXPLAIN.

I used the OPTIONAL match because I want event the Memberships that do not have AssetGroups

and I sorted and limited the nested AssetGroup for each Membership, so that i could return them already consolidated.

The problem is that it is limiting the memberships as well, it is not 100% correct yet

joshcornejo · February 27, 2025, 1:08pm

That's why you can have *0..1, did you try it ?

joshcornejo · February 27, 2025, 1:16pm

I don't have your graph, so you might have to go trial-and-error:

PROFILE 
MATCH (root_node:Business { source_id:"0101-0003-0006-00000001" })
  -[:CAN_READ*0..1]->(lvl_0:Membership) 
  -[:CAN_READ*0..1]->(lvl_0_child_0:AssetGroup)
WHERE (lvl_0.status IN [ "ACTIVE", "DISABLED", "PENDING" ] )
  AND (lvl_0.insertion_time <= "2025-02-27T03:37:12.1748779Z")
  AND (lvl_0_child_0.status IN ["ACTIVE"])
  ORDER BY lvl_0.insertion_time, lvl_0_child_0.insertion_time
RETURN {node: lvl_0, children: lvl_0_child_0} limit 20

victor · February 27, 2025, 2:38pm

Hi Josh! I've tried with *0..1, but did not work.

But I am trying to make it work so I can update here with a correct query.

Thanks for the feedback and the ideias, they were precious!

I will post here soon a working query, if I try to make it work

Thank you very much!

victor · March 6, 2025, 5:07pm

Hi Josh!

I have optimized the query but I maintained the structure. Here it is:

PROFILE

MATCH (root_node:Entity:Business {source_id:"0101-0003-0006-00000001"})-[lvl_0_main_relationship_can_read:CAN_READ]-> (lvl_0:Entity:Membership)
WHERE (lvl_0.status IN [ "ACTIVE",
                  "DISABLED",
                  "PENDING" ] )
      AND (lvl_0.insertion_time <= $date)
      WITH DISTINCT lvl_0,
      root_node
ORDER BY lvl_0.insertion_unix_time DESC SKIP 0
LIMIT 20
CALL {
      WITH lvl_0,
            root_node

      OPTIONAL MATCH (root_node)-[:CAN_READ]->(lvl_0)-[lvl_0_child_0_main_relationship_can_read:CAN_READ]-> (lvl_0_child_0:Entity:AssetGroup)
      WHERE (lvl_0_child_0.status IN [ "ACTIVE" ] )
      WITH DISTINCT lvl_0_child_0,
            lvl_0,
            root_node
      ORDER BY lvl_0_child_0.insertion_unix_time desc
      LIMIT 20
      CALL {
      WITH lvl_0_child_0,
                  lvl_0

            OPTIONAL MATCH (lvl_0)-[:CAN_READ]->(lvl_0_child_0)-[lvl_0_child_0_child_0_main_relationship_can_read:CAN_READ]-> (lvl_0_child_0_child_0:Entity:FacebookAdAccount)
            WHERE (lvl_0_child_0_child_0.status IN [ "ACTIVE",
                              "DISABLED",
                              "UNVERIFIED" ] )
      AND lvl_0_child_0_child_0.name
            CONTAINS "27"
      WITH DISTINCT lvl_0_child_0_child_0,
                  lvl_0_child_0,
                  lvl_0
            ORDER BY lvl_0_child_0_child_0.insertion_unix_time desc
            LIMIT 20
            CALL {
      WITH lvl_0_child_0_child_0

                  OPTIONAL MATCH (lvl_0_child_0_child_0)-[:BELONGS_TO]-> (lvl_0_child_0_child_0_child_0:Tenant)
      WITH lvl_0_child_0_child_0_child_0,
                        lvl_0_child_0_child_0
                  LIMIT 1
                  RETURN COLLECT({node: lvl_0_child_0_child_0_child_0,
                                    children:[]}) as lvl_0_child_0_child_0_child_0_collected }
            RETURN COLLECT({node: lvl_0_child_0_child_0,
                              children:[{edge: "tenants",
                                          nodes: lvl_0_child_0_child_0_child_0_collected}]}) as lvl_0_child_0_child_0_collected }
      CALL {
      WITH lvl_0_child_0

            OPTIONAL MATCH (lvl_0_child_0)-[:BELONGS_TO]-> (lvl_0_child_0_child_1:Tenant)
      WITH lvl_0_child_0_child_1,
                  lvl_0_child_0
            LIMIT 1
            RETURN COLLECT({node: lvl_0_child_0_child_1,
                              children:[]}) as lvl_0_child_0_child_1_collected }
      RETURN COLLECT({node: lvl_0_child_0,
                        children:[{edge: "facebook_ad_accounts",
                                    nodes: lvl_0_child_0_child_0_collected},
                              {edge: "tenants",
                                    nodes: lvl_0_child_0_child_1_collected}]}) as lvl_0_child_0_collected 
}
CALL {
      WITH lvl_0

      OPTIONAL MATCH (lvl_0)-[:BELONGS_TO]-> (lvl_0_child_1:Tenant)
      WITH lvl_0_child_1,
            lvl_0
      LIMIT 1
      RETURN COLLECT({node: lvl_0_child_1,
                        children:[]}) as lvl_0_child_1_collected 
}
RETURN {
            node: lvl_0,
            children: [{edge: 'edge1',
                        nodes: lvl_0_child_0_collected},
                  {edge: 'edge2',
                        nodes: lvl_0_child_1_collected}] 
}

Unfortunately I could not create the same response format using a single query.

But removing the "collect(distinct(...))" from the query helped decrease the memory. Now the nodes are limited before going to the subquery, and it only processes the filtered nodes.

I still am not happy with this, I really wanted to make it work like your suggestion, but I failed so far.

Anyway, just sharing the results

EDIT:

I also removed the DISTINCT and OPTIONAL MATCH where is safe to do it:

PROFILE

MATCH (root_node:Entity:Business {source_id:"0101-0003-0006-00000001"})-[lvl_0_main_relationship_can_read:CAN_READ]-> (lvl_0:Entity:Membership)
WHERE (lvl_0.status IN [ "ACTIVE",
                  "DISABLED",
                  "PENDING" ] )
      AND (lvl_0.insertion_time <= $date)
      WITH DISTINCT lvl_0,
      root_node
ORDER BY lvl_0.insertion_unix_time DESC SKIP 0
LIMIT 20
CALL {
      WITH lvl_0,
            root_node

      MATCH (root_node)-[:CAN_READ]->(lvl_0)-[lvl_0_child_0_main_relationship_can_read:CAN_READ]-> (lvl_0_child_0:Entity:AssetGroup)
      WHERE (lvl_0_child_0.status IN [ "ACTIVE" ] )
      WITH lvl_0_child_0,
            lvl_0,
            root_node
      ORDER BY lvl_0_child_0.insertion_unix_time desc
      LIMIT 20
      CALL {
      WITH lvl_0_child_0,
                  lvl_0

            MATCH (lvl_0)-[:CAN_READ]->(lvl_0_child_0)-[lvl_0_child_0_child_0_main_relationship_can_read:CAN_READ]-> (lvl_0_child_0_child_0:Entity:FacebookAdAccount)
            WHERE (lvl_0_child_0_child_0.status IN [ "ACTIVE",
                              "DISABLED",
                              "UNVERIFIED" ] )
      AND lvl_0_child_0_child_0.name
            CONTAINS "27"
      WITH lvl_0_child_0_child_0,
                  lvl_0_child_0,
                  lvl_0
            ORDER BY lvl_0_child_0_child_0.insertion_unix_time desc
            LIMIT 20
            CALL {
      WITH lvl_0_child_0_child_0

                  MATCH (lvl_0_child_0_child_0)-[:BELONGS_TO]-> (lvl_0_child_0_child_0_child_0:Tenant)
      WITH lvl_0_child_0_child_0_child_0,
                        lvl_0_child_0_child_0
                  LIMIT 1
                  RETURN COLLECT({node: lvl_0_child_0_child_0_child_0,
                                    children:[]}) as lvl_0_child_0_child_0_child_0_collected }
            RETURN COLLECT({node: lvl_0_child_0_child_0,
                              children:[{edge: "tenants",
                                          nodes: lvl_0_child_0_child_0_child_0_collected}]}) as lvl_0_child_0_child_0_collected }
      CALL {
      WITH lvl_0_child_0

            MATCH (lvl_0_child_0)-[:BELONGS_TO]-> (lvl_0_child_0_child_1:Tenant)
      WITH lvl_0_child_0_child_1,
                  lvl_0_child_0
            LIMIT 1
            RETURN COLLECT({node: lvl_0_child_0_child_1,
                              children:[]}) as lvl_0_child_0_child_1_collected }
      RETURN COLLECT({node: lvl_0_child_0,
                        children:[{edge: "facebook_ad_accounts",
                                    nodes: lvl_0_child_0_child_0_collected},
                              {edge: "tenants",
                                    nodes: lvl_0_child_0_child_1_collected}]}) as lvl_0_child_0_collected 
}
CALL {
      WITH lvl_0

      MATCH (lvl_0)-[:BELONGS_TO]-> (lvl_0_child_1:Tenant)
      WITH lvl_0_child_1,
            lvl_0
      LIMIT 1
      RETURN COLLECT({node: lvl_0_child_1,
                        children:[]}) as lvl_0_child_1_collected 
}
RETURN {
            node: lvl_0,
            children: [{edge: 'edge1',
                        nodes: lvl_0_child_0_collected},
                  {edge: 'edge2',
                        nodes: lvl_0_child_1_collected}] 
}

it helped alot!

Topic		Replies	Views
An effort to better understand the underlying mechanism of query filters and profiling Graph Algorithms/Graph Data Science performance , cypher	12	965	July 27, 2020
Graph Data Modeling Question Modeling performance , neo4j-desktop , modeling , data-modeling	12	1242	May 4, 2021
How to fetch millions of data faster? Cypher	30	8500	October 18, 2019
Slowness in execution of Cypher query when we use multiple relationship and inner queries Cypher performance , cypher	13	87	April 10, 2025
Tuning Cypher queries by understanding cardinality Cypher performance , cypher , knowledge-base	0	1001	August 23, 2018

August Summer Fun!