Questions about my query model

Hi folks, I am building a Graph API where I can get nodes and edges subsequently in the same request.

I am having a bit of trouble trying to optimize this query (please ignore literal values as I replace them with parameters on the application). I am getting out of ideas on how to make them better and use less memory (main issue).

Basically my query get lists of a specific type o node from another one, then orders by insertion_time and then I use subqueries to get some more inner nodes. We have a lot of types of queries, but I will examplify with one of them.

Basically I am using a combination of COMBINE and DISTINCT and then limiting the nodes so I can prepare them for the following query, because it seems better to filter them before going to the sub queries.

Here is the query:

	MATCH (root_node:Entity:Business {source_id:"0101-0003-0006-00000001"})-[lvl_0_main_relationship_can_read:CAN_READ]-> (lvl_0:Entity:Membership)
		  WITH lvl_0,
		  root_node
	ORDER BY lvl_0.insertion_time DESC
	WHERE (lvl_0.status IN [ "ACTIVE",
					  "DISABLED",
					  "PENDING" ] )
		  AND (lvl_0.insertion_time <= "2025-02-27T03:37:12.1748779Z")
		  WITH COLLECT(DISTINCT(lvl_0))[0..20] as lvl_0_collected,
		  root_node
	UNWIND lvl_0_collected AS lvl_0
	CALL {
		  WITH lvl_0,
				root_node

		  OPTIONAL MATCH (root_node)-[:CAN_READ]->(lvl_0)-[lvl_0_child_0_main_relationship_can_read:CAN_READ]-> (lvl_0_child_0:Entity:AssetGroup)
		  WITH lvl_0_child_0,
				lvl_0,
				root_node
		  ORDER BY lvl_0_child_0.insertion_time desc
		  WHERE (lvl_0_child_0.status IN [ "ACTIVE" ] )
		  WITH collect(distinct(lvl_0_child_0))[0..20] as lvl_0_child_0_collected,
				lvl_0,
				root_node
		  UNWIND lvl_0_child_0_collected as lvl_0_child_0
		  CALL {
		  WITH lvl_0_child_0

				OPTIONAL MATCH (lvl_0_child_0)-[:BELONGS_TO]-> (lvl_0_child_0_child_0:Tenant)
		  WITH lvl_0_child_0_child_0,
					  lvl_0_child_0
		  WITH collect(distinct(lvl_0_child_0_child_0))[0..1] as lvl_0_child_0_child_0_collected,
					  lvl_0_child_0
				UNWIND lvl_0_child_0_child_0_collected as lvl_0_child_0_child_0
				RETURN COLLECT(DISTINCT({node: lvl_0_child_0_child_0,
										children:[]})) as lvl_0_child_0_child_0_collected }
		  RETURN COLLECT(DISTINCT({node: lvl_0_child_0,
								  children:[{edge: "tenants",
											  nodes: lvl_0_child_0_child_0_collected}]})) as lvl_0_child_0_collected 
	}
	CALL {
		  WITH lvl_0

		  OPTIONAL MATCH (lvl_0)-[:BELONGS_TO]-> (lvl_0_child_1:Tenant)
		  WITH DISTINCT lvl_0_child_1
		  LIMIT 1
		  RETURN COLLECT({node: lvl_0_child_1,
							children:[]}) as lvl_0_child_1_collected 
	}
	RETURN COLLECT({node: lvl_0,
					  children: [{edge: 'edge1',
								  nodes: lvl_0_child_0_collected},
							{edge: 'tenants',
								  nodes: lvl_0_child_1_collected}] })

What I would like some help is to understand if the issue is the "COLLECT/DISTINCT" that I am using, and if there is some better way to sort the results, filter and then run the subquery. This pattern seems to do the least dbhits I could find, but I think that it is not the most memory-optimized.

My db has a pattern of indexes for {Label} on source_id and for Entity on insertion_time.

But I think the main issue is the too long profile that it generates. I Would appreciate some ideas to make it more optimal, maintaining the response. I am running out of ideias, so I would really appreciate any insight.

If there is any missing info, please let me know

Explain:

Hi Victor,

At first glance, you are creating a query, then using the output converted as a list, and passed as a parameter to the next query, this parameter goes back into the following query (rather than being a continuation of the query), so that is probably why your EXPLAIN looks so long.

Why not start with consolidating the queries?

MATCH (root_node:Entity:Business { source_id:"0101-0003-0006-00000001" })-[:CAN_READ]->(lvl_0:Entity:Membership)-[:CAN_READ*0..1]->(lvl_0_child_0:Entity:AssetGroup)
WITH lvl_0, root_node
  WHERE (lvl_0.status IN [ "ACTIVE", "DISABLED", "PENDING" ] )
  AND (lvl_0_child_0.status IN ["ACTIVE"])
  AND (lvl_0.insertion_time <= "2025-02-27T03:37:12.1748779Z")

You can also probably leave the "order by" to the end (I use epoch's - long - and do the sort in the consumer of the query, but that works in my case as the results are usually a handful)

Also, might it be useful if you show a graph?

Hi Josh! Thank you so much for your time trying to help me.

I try to make this chained queries so that I can get the chained response, here is an example os the query response:

I may want to make the graph call:

https://{host}/{version}/{business_id}/memberships?fields=asset_groups?limit=4

The response looks very similar to Meta's Graph api, which is where I took my inspiration from.

This would return the following graph:

you can see that I returned 4 "Memberships" Nodes and its respectived connected "AssetGroup" nodes.

Here is an example json: JSON Editor Online: edit JSON, format JSON, query JSON

What do you mean in your example of "consolidating the queries"? You mean making multiples cypher calls instead of a single one?

It is interesting doing the order by in the consumer, but I think in my case would waste too much processing and get a lot of unsused data. I was trying to get the complete chained nodes on neo4j so that I could continue my application's request.

Also. I do the order by and filter to the list before going to the subquery so that I can only process the nodes that are going to return. Let's say there are 100 Membership nodes in the specified Business Node, I would only get 4 of them, so that I can get subnodes only from each of them, it was the way I found to saved up dbhits. Please, let me know if this strategy makes sense. But I understand that the order by is one of the main reasons why it takes up so much processing, but I was not able to change it yet, because as I said, my application returns the nodes always in this order (most recent).

I mean doing less MATCH statements and more refinement parts.

but you already mentioned 20M nodes? so you are getting unused data anyway - depends on how many clients you have, and the rate of change of your nodes?

Yes - that makes sense, but you seem to have a few order by hence why I recommended doing it once in the client

And re:GraphQL, here is something I sent to someone else on a different thread somewhere else (courtesy of ChatGPT):

GraphQL offers flexibility and efficiency in API design, but it comes with scalability challenges, especially as applications grow. Some key scalability problems include:

1. N+1 Query Problem

  • When fetching nested data, GraphQL might execute multiple small queries instead of an optimized batch request.
  • Example: Fetching a list of users and their posts might trigger separate database queries for each user, leading to excessive database hits.

Solution: Use data loaders or batch queries to reduce redundant requests.

2. Complex Queries Leading to Performance Issues

  • Clients can construct deeply nested or overly broad queries, resulting in slow database operations.
  • Unlike REST, where endpoints have fixed responses, GraphQL’s flexibility allows clients to request large amounts of data in a single query.

Solution: Implement query complexity analysis or depth limiting to prevent inefficient queries.

3. Caching Difficulties

  • REST APIs benefit from HTTP caching at the CDN level, but GraphQL queries are typically sent via POST requests, making caching harder.
  • Dynamic query structures mean responses are less predictable, reducing cache reusability.

Solution: Use persistent queries and response caching strategies at the resolver or database layer.

4. Authorization & Security Risks

  • More complex permission handling due to fine-grained data access.
  • Malicious users can craft expensive queries to overload the backend (e.g., requesting deeply nested data).

Solution: Implement query cost analysis, rate limiting, and field-level authorization.

5. Schema Complexity & Maintainability

  • Large applications with many types and resolvers can lead to difficult-to-maintain schemas.
  • Versioning can be tricky since GraphQL avoids traditional REST versioning practices.

Solution: Use schema stitching or federation for modular design, and deprecate fields instead of versioning.

6. Scaling GraphQL Servers

  • Since GraphQL servers handle complex query parsing and execution, they require more CPU/memory than simpler REST endpoints.
  • High-traffic applications need efficient resource management to prevent bottlenecks.

Solution: Use efficient resolver patterns, load balancing, and consider serverless GraphQL approaches.

Would you like specific recommendations for scaling GraphQL in your use case?

Here in the example, I using a test database, but in production we have over 10M nodes. We have a considerated amount of consumers already, and they make these type os requests constantly, some of them are a bit more long, but it always looks like this example. The rate of change of the nodes happen constantly, usually the creation of more nodes. I have limited each subtree to have a maximum os nodes, so a single node would not have an infinite amount of sub-nodes of the same type connected to them.

Well, I guess I am, but the filtering I mentioned doing avoids from getting unrequired sub nodes from nodes that would not be in the response, but it does not seem to be a good trade, as I am looking at the metrics. The CPU is really fine, the main issue is the memory, this is the optimization I am look for


Please forgive me if I did not understant correctly, but what you are trying to recommend is the following:

FIrst, the the first-level nodes:

MATCH (root_node:Entity:Business {source_id:"0101-0003-0006-00000001"})-[lvl_0_main_relationship_can_read:CAN_READ]-> (lvl_0:Entity:Membership)
WHERE (lvl_0.status IN [ "ACTIVE",
				  "DISABLED",
				  "PENDING" ] )
	  AND (lvl_0.insertion_time <= 12123131231231
	  return lvl_0 order by lvl_0.insertion_time_unix_time

And Then, only after getting the response, do the subsequent calls separately on other cypher requests for the returned nodes, something like:

MATCH (root_node:Entity:Membership)-[:CAN_READ]-> (lvl_0:Entity:AssetGroup)
WHERE root_node.source_id in ["1","2","3"] AND (lvl_0.status IN [ "ACTIVE"] )
	  AND (lvl_0.insertion_time_unix_time<= 12123131231231
	  return lvl_0 order by lvl_0.insertion_time_unix_time

That could very well be a possibility. I was considering it, if I could not return the chained request in a optimal way, but I was considering if there would be a tweak that I could make to the request so that it would shorten the EXPLAIN. the main issue seems to be the order by indeed.

I understand that graphql really gives this kind of problem, but I am trying to optimize in a single query first. Maybe it is better to have more dbhits and consume less query memory than the other way around? I don't know, I am thinking about that now.

In this message, I consolidated 2 of your original MATCH into one, you could also aim at doing the sort here once (as it will be in the filtered output of 2 subsets):

Well indeed, sorry about that. And that's really a great ideia.

The "Membership" may not have connected "AssetGroup", so would it make sense to do something likes this then?

MATCH (root_node:Entity:Business { source_id:"0101-0003-0006-00000001" })-[:CAN_READ]->(lvl_0:Entity:Membership)
OPTIONAL MATCH (lvl_0)-[:CAN_READ]->(lvl_0_child_0:Entity:AssetGroup)
WITH lvl_0, root_node, lvl_0_child_0
  WHERE (lvl_0.status IN [ "ACTIVE", "DISABLED", "PENDING" ] )
  AND (lvl_0_child_0.status IN ["ACTIVE"])
  AND (lvl_0.insertion_time <= "2025-02-27T03:37:12.1748779Z")
 return {node: lvl_0,children: lvl_0_child_0} limit 20

Only because if I use the single MATCH, it would only return Memberships that have AssetGroups, it seems

Also, it seems that if the Membership has 2 AssetGroup, it is returning as 2 items in the response, instead of 1, with 2 "children" nodes. I am trying to re-build this query to see if there is a way to consolidate it in one node response. Also I forgot on the query to limit the subnode. I will share it here if I am able to build the correct query.

Don't know if this is too much already, but i would look something likes?

PROFILE MATCH (root_node:Entity:Business { source_id:"0101-0003-0006-00000001" })-[:CAN_READ]->(lvl_0:Entity:Membership)
OPTIONAL MATCH (lvl_0)-[:CAN_READ]->(lvl_0_child_0:Entity:AssetGroup)
WITH lvl_0, root_node, lvl_0_child_0
  WHERE (lvl_0.status IN [ "ACTIVE", "DISABLED", "PENDING" ] )
  AND (lvl_0.insertion_time <= "2025-02-27T03:37:12.1748779Z")
  AND (lvl_0_child_0.status IN ["ACTIVE"])
  WITH lvl_0_child_0, lvl_0, root_node
  order by lvl_0_child_0.insertion_time desc limit 20

 return {node: lvl_0,children: lvl_0_child_0} order by lvl_0.insertion_time limit 20

It was the way I found to return the same response, really downsizing the EXPLAIN.

I used the OPTIONAL match because I want event the Memberships that do not have AssetGroups

and I sorted and limited the nested AssetGroup for each Membership, so that i could return them already consolidated.

The problem is that it is limiting the memberships as well, it is not 100% correct yet

That's why you can have *0..1, did you try it ?

I don't have your graph, so you might have to go trial-and-error:

PROFILE 
MATCH (root_node:Business { source_id:"0101-0003-0006-00000001" })
  -[:CAN_READ*0..1]->(lvl_0:Membership) 
  -[:CAN_READ*0..1]->(lvl_0_child_0:AssetGroup)
WHERE (lvl_0.status IN [ "ACTIVE", "DISABLED", "PENDING" ] )
  AND (lvl_0.insertion_time <= "2025-02-27T03:37:12.1748779Z")
  AND (lvl_0_child_0.status IN ["ACTIVE"])
  ORDER BY lvl_0.insertion_time, lvl_0_child_0.insertion_time
RETURN {node: lvl_0, children: lvl_0_child_0} limit 20
1 Like

Hi Josh! I've tried with *0..1, but did not work.

But I am trying to make it work so I can update here with a correct query.

Thanks for the feedback and the ideias, they were precious!

I will post here soon a working query, if I try to make it work

Thank you very much!

1 Like

Hi Josh!

I have optimized the query but I maintained the structure. Here it is:

PROFILE

MATCH (root_node:Entity:Business {source_id:"0101-0003-0006-00000001"})-[lvl_0_main_relationship_can_read:CAN_READ]-> (lvl_0:Entity:Membership)
WHERE (lvl_0.status IN [ "ACTIVE",
                  "DISABLED",
                  "PENDING" ] )
      AND (lvl_0.insertion_time <= $date)
      WITH DISTINCT lvl_0,
      root_node
ORDER BY lvl_0.insertion_unix_time DESC SKIP 0
LIMIT 20
CALL {
      WITH lvl_0,
            root_node

      OPTIONAL MATCH (root_node)-[:CAN_READ]->(lvl_0)-[lvl_0_child_0_main_relationship_can_read:CAN_READ]-> (lvl_0_child_0:Entity:AssetGroup)
      WHERE (lvl_0_child_0.status IN [ "ACTIVE" ] )
      WITH DISTINCT lvl_0_child_0,
            lvl_0,
            root_node
      ORDER BY lvl_0_child_0.insertion_unix_time desc
      LIMIT 20
      CALL {
      WITH lvl_0_child_0,
                  lvl_0

            OPTIONAL MATCH (lvl_0)-[:CAN_READ]->(lvl_0_child_0)-[lvl_0_child_0_child_0_main_relationship_can_read:CAN_READ]-> (lvl_0_child_0_child_0:Entity:FacebookAdAccount)
            WHERE (lvl_0_child_0_child_0.status IN [ "ACTIVE",
                              "DISABLED",
                              "UNVERIFIED" ] )
      AND lvl_0_child_0_child_0.name
            CONTAINS "27"
      WITH DISTINCT lvl_0_child_0_child_0,
                  lvl_0_child_0,
                  lvl_0
            ORDER BY lvl_0_child_0_child_0.insertion_unix_time desc
            LIMIT 20
            CALL {
      WITH lvl_0_child_0_child_0

                  OPTIONAL MATCH (lvl_0_child_0_child_0)-[:BELONGS_TO]-> (lvl_0_child_0_child_0_child_0:Tenant)
      WITH lvl_0_child_0_child_0_child_0,
                        lvl_0_child_0_child_0
                  LIMIT 1
                  RETURN COLLECT({node: lvl_0_child_0_child_0_child_0,
                                    children:[]}) as lvl_0_child_0_child_0_child_0_collected }
            RETURN COLLECT({node: lvl_0_child_0_child_0,
                              children:[{edge: "tenants",
                                          nodes: lvl_0_child_0_child_0_child_0_collected}]}) as lvl_0_child_0_child_0_collected }
      CALL {
      WITH lvl_0_child_0

            OPTIONAL MATCH (lvl_0_child_0)-[:BELONGS_TO]-> (lvl_0_child_0_child_1:Tenant)
      WITH lvl_0_child_0_child_1,
                  lvl_0_child_0
            LIMIT 1
            RETURN COLLECT({node: lvl_0_child_0_child_1,
                              children:[]}) as lvl_0_child_0_child_1_collected }
      RETURN COLLECT({node: lvl_0_child_0,
                        children:[{edge: "facebook_ad_accounts",
                                    nodes: lvl_0_child_0_child_0_collected},
                              {edge: "tenants",
                                    nodes: lvl_0_child_0_child_1_collected}]}) as lvl_0_child_0_collected 
}
CALL {
      WITH lvl_0

      OPTIONAL MATCH (lvl_0)-[:BELONGS_TO]-> (lvl_0_child_1:Tenant)
      WITH lvl_0_child_1,
            lvl_0
      LIMIT 1
      RETURN COLLECT({node: lvl_0_child_1,
                        children:[]}) as lvl_0_child_1_collected 
}
RETURN {
            node: lvl_0,
            children: [{edge: 'edge1',
                        nodes: lvl_0_child_0_collected},
                  {edge: 'edge2',
                        nodes: lvl_0_child_1_collected}] 
}

Unfortunately I could not create the same response format using a single query.

But removing the "collect(distinct(...))" from the query helped decrease the memory. Now the nodes are limited before going to the subquery, and it only processes the filtered nodes.

I still am not happy with this, I really wanted to make it work like your suggestion, but I failed so far.

Anyway, just sharing the results

EDIT:

I also removed the DISTINCT and OPTIONAL MATCH where is safe to do it:

PROFILE

MATCH (root_node:Entity:Business {source_id:"0101-0003-0006-00000001"})-[lvl_0_main_relationship_can_read:CAN_READ]-> (lvl_0:Entity:Membership)
WHERE (lvl_0.status IN [ "ACTIVE",
                  "DISABLED",
                  "PENDING" ] )
      AND (lvl_0.insertion_time <= $date)
      WITH DISTINCT lvl_0,
      root_node
ORDER BY lvl_0.insertion_unix_time DESC SKIP 0
LIMIT 20
CALL {
      WITH lvl_0,
            root_node

      MATCH (root_node)-[:CAN_READ]->(lvl_0)-[lvl_0_child_0_main_relationship_can_read:CAN_READ]-> (lvl_0_child_0:Entity:AssetGroup)
      WHERE (lvl_0_child_0.status IN [ "ACTIVE" ] )
      WITH lvl_0_child_0,
            lvl_0,
            root_node
      ORDER BY lvl_0_child_0.insertion_unix_time desc
      LIMIT 20
      CALL {
      WITH lvl_0_child_0,
                  lvl_0

            MATCH (lvl_0)-[:CAN_READ]->(lvl_0_child_0)-[lvl_0_child_0_child_0_main_relationship_can_read:CAN_READ]-> (lvl_0_child_0_child_0:Entity:FacebookAdAccount)
            WHERE (lvl_0_child_0_child_0.status IN [ "ACTIVE",
                              "DISABLED",
                              "UNVERIFIED" ] )
      AND lvl_0_child_0_child_0.name
            CONTAINS "27"
      WITH lvl_0_child_0_child_0,
                  lvl_0_child_0,
                  lvl_0
            ORDER BY lvl_0_child_0_child_0.insertion_unix_time desc
            LIMIT 20
            CALL {
      WITH lvl_0_child_0_child_0

                  MATCH (lvl_0_child_0_child_0)-[:BELONGS_TO]-> (lvl_0_child_0_child_0_child_0:Tenant)
      WITH lvl_0_child_0_child_0_child_0,
                        lvl_0_child_0_child_0
                  LIMIT 1
                  RETURN COLLECT({node: lvl_0_child_0_child_0_child_0,
                                    children:[]}) as lvl_0_child_0_child_0_child_0_collected }
            RETURN COLLECT({node: lvl_0_child_0_child_0,
                              children:[{edge: "tenants",
                                          nodes: lvl_0_child_0_child_0_child_0_collected}]}) as lvl_0_child_0_child_0_collected }
      CALL {
      WITH lvl_0_child_0

            MATCH (lvl_0_child_0)-[:BELONGS_TO]-> (lvl_0_child_0_child_1:Tenant)
      WITH lvl_0_child_0_child_1,
                  lvl_0_child_0
            LIMIT 1
            RETURN COLLECT({node: lvl_0_child_0_child_1,
                              children:[]}) as lvl_0_child_0_child_1_collected }
      RETURN COLLECT({node: lvl_0_child_0,
                        children:[{edge: "facebook_ad_accounts",
                                    nodes: lvl_0_child_0_child_0_collected},
                              {edge: "tenants",
                                    nodes: lvl_0_child_0_child_1_collected}]}) as lvl_0_child_0_collected 
}
CALL {
      WITH lvl_0

      MATCH (lvl_0)-[:BELONGS_TO]-> (lvl_0_child_1:Tenant)
      WITH lvl_0_child_1,
            lvl_0
      LIMIT 1
      RETURN COLLECT({node: lvl_0_child_1,
                        children:[]}) as lvl_0_child_1_collected 
}
RETURN {
            node: lvl_0,
            children: [{edge: 'edge1',
                        nodes: lvl_0_child_0_collected},
                  {edge: 'edge2',
                        nodes: lvl_0_child_1_collected}] 
}

it helped alot!