Can we improve this further: Optimize Expand Operation

performance
union

(Phanindhar Bodla) #1

WITH [8521,8522,8519] as place_ids
MATCH (place:Location)-[:VISITED_BY]->(user:User)
where place.id in place_ids
WITH collect(distinct user) as users
return users

This query runs for 1.2 s for 3 place_ids places ids can grow up-to 20 with different combinations.
For just a single place its around 350ms and 70 % users are overlapped for places. I understand that we has visit a user multiple times for (place*users) no of relations. But is there way to hint the cypher to mark the visited node or parallelize or some other way to get the time down? It is too scary to imagine runtimes for 20 places. So please suggest anything that may cutdown the runtimes.

Thanks In Advance


(M. David Allen) #2

Please put EXPLAIN in front of your query and post the query plan. The most immediate thing that jumps to mind is whether or not you have an index on :Location(id) and whether the query planner is using it.

The next thing that jumps to mind is the degree of those nodes, how many users have visited them. It could take some time to get a unique list if the total list of users who have visited a place is very large.


(Phanindhar Bodla) #3

Hi @david.allen, I've indexed both the nodes. Please have a look at the plan.

  neo4j> explain
   WITH [8521,8522,8519] as child_ids
   MATCH (lo:LO)<-[:SYLLABUS_LO{type:'in syllabus'}]-(child_syllabus:Syllabus)
   where child_syllabus.id in child_ids
       WITH collect(distinct lo) as los
       return los
   ;
+---------------------------------------------------------------------+
| Plan      | Statement   | Version      | Planner | Runtime   | Time |
+---------------------------------------------------------------------+
| "EXPLAIN" | "READ_ONLY" | "CYPHER 3.4" | "COST"  | "SLOTTED" | 5    |
+---------------------------------------------------------------------+

+------------------------+----------------+-----------------------------------------+-----------------------------------------------+
| Operator               | Estimated Rows | Identifiers                             | Other                                         |
+------------------------+----------------+-----------------------------------------+-----------------------------------------------+
| +ProduceResults        |            154 | los                                     | 3.4; 3.4                                      |
| |                      +----------------+-----------------------------------------+-----------------------------------------------+
| +EagerAggregation      |            154 | los                                     |                                               |
| |                      +----------------+-----------------------------------------+-----------------------------------------------+
| +Filter                |          23667 | anon[49], child_ids, child_syllabus, lo | lo:LO                                         |
| |                      +----------------+-----------------------------------------+-----------------------------------------------+
| +Expand(All)           |         236671 | anon[49], child_ids, child_syllabus, lo | (child_syllabus)-[anon[49]:SYLLABUS_LO]->(lo) |
| |                      +----------------+-----------------------------------------+-----------------------------------------------+
| +Apply                 |          23667 | child_ids, child_syllabus               |                                               |
| |\                     +----------------+-----------------------------------------+-----------------------------------------------+
| | +NodeUniqueIndexSeek |              9 | child_ids, child_syllabus               | :Syllabus(id)                                 |
| |                      +----------------+-----------------------------------------+-----------------------------------------------+
| +Projection            |              1 | child_ids                               | {child_ids : $`  AUTOLIST0`}                  |
+------------------------+----------------+-----------------------------------------+-----------------------------------------------+

Also the I've used alias to provide a better context for community to understand. This is the label mapping Location = Syllabus , User = LO, in the above query. BTW I've used multi line union queries and they were not even close to this in runtime.

Many Thanks,


(Andrew Bowman) #4

Next step is to PROFILE this and provide us with the profile plan so we can see the rows and db hits.

If you can take the visual profile plan from running through the browser, please make sure to expand all elements before saving the image.


(Phanindhar Bodla) #5

PFA


(Andrew Bowman) #6

This looks like an opportunity to change up your model a bit to allow more streamlined expansions.

We can see that your expansion hits more than 77k nodes, and only after filtering (both on the relationship 'type' property and on the :LO label of the connecting nodes) has a high number of db hits and narrows down results to 17k rows.

Node label filtering is relatively cheap, so my guess is that most of the effort is spent on relationship property access to filter on the type property. If you can refactor your graph so you have distinct relationship types rather than using a 'type' property, it's likely we can make this much cheaper, both by avoiding property access, and by avoiding expansion of relationships whose types we don't care about.

So if you use :IN_SYLLABUS relationships instead of :SYLLABUS_LO with a type (and likewise use distinct relationship types in place of the other 'type' values) you're likely to save quite a bit whenever you're doing expansions that should only be on specific types.


(Phanindhar Bodla) #7

Hi @andrew.bowman,

This approach surely does look interesting . Will try out this, and will update the thread.
Thanks for your time. Also it will be great if you can suggest some good books/articles/references on modeling graph for better runtimes.


(Andrew Bowman) #8

You'll probably want to go through Max De Marzi's blog, he's great with optimizing models.

In general though it's a good idea to try to minimize property access, including filtering by property (this is not the same as lookup by property, which can leverage index lookup) unless there's no way around it.

Using distinct relationship types (as in this case) is probably the most useful kind of optimization for traversals (Max uses this quite a bit), since that means we don't even have to expand the types we're not interested in, so we don't even have to perform filtering.


(Phanindhar Bodla) #9

@andrew.bowman Thanks for the great tip. Doing what you suggested helped me to achieve exactly the thing that am looking for, After adding the new label for these relations I've tried to understand the reason, for which, I've profiled and observed that a node is visited exactly once. This is a huge improvement.

Thanks,


(Andrew Bowman) #10

Excellent, glad to help!