Query taking much longer to run when I filter based on some nodes

Hello,

I'm having some troubleshooting issues in regards to a particular query. I have a graphdb that contains courses nodes, skill nodes and skill_group nodes.

A while ago, I initially imported the first 100 course nodes with each node having a relationship to the 10,000 skill nodes.
Recently, I imported about 2000 more course nodes, each with a relationship to the the 10,000 skill nodes.

The issue that I'm having is that, whenever I'm using the ID(s) from the new 2000 course nodes in the list parameter - $user_courses , the query runtime takes too long - about 10+ secs (and it only runs much quicker on 2nd or 3rd query run).

However, the query runs smoothly when I'm using ID(s) from the initial 100 course nodes - about a second or less.

I'm running this query through python function for an application. Additionally, I have also tried putting an index on course.id (And changing ID(c) to c.id in the code), however, i'm still getting the same issue.

I'm wondering if anyone can help me ?

match (c:courses)-[r]-(s:job_skills)-[]-(d:skill_group)
where ID(c) in $user_courses
with s,d,avg(r.similarity) as average_sim order by average_sim desc limit 100
with d,avg(average_sim) as average_sim order by average_sim desc
return d.name as skill_group_name,ID(d) as skill_group_id

Hi @g.fraser ,

First a sidenote that id()is being deprecated in favor of elementId().

Could you clarify something about the graph size -- have I understood correctly that each of the :courses nodes has a relationship to every :job_skills node, which would look something like this...

To start investigating the slowdown, run the queries in Neo4j Browser, prepending the MATCH clause with the keyword PROFILE to get a view of the execution plan.

Thanks,
ABK

Yes, that diagram is the right structure, but to clarify, the graphDB also contains other nodes and relationships (such as job nodes, user nodes, job to skill relationships) that doesn't pertain to this query.

I will provide two execution plans: 1) for the query using IDs from the 100 courses, 2) for the query using IDs from the 2000 courses

1) IDs used from 100 courses





2) IDs used from 2000 courses





As you can see from the first image in both cases, the first query takes about 1 sec to run, while the second query takes about 13 secs.

fyi, "job level_3 skills" = "skills_group"

Well, ok. Staring at those execution plans, they're obviously virtually identical. I've consulted with some other folks. It's possible that you've run into a bug. So first I'd like to attempt to reproduce it.

What version of Neo4j is this?

Is it possible to share the dataset?

Best,
ABK

I feel this query is a little 'off'.

You are matching along an entire path from courses to skill_group nodes. You are then aggregating over pairs of unique combinations of 's' and 'd' nodes, but you are aggregating the 'similarity' property in relationship 'r'. Relationship 'r' is not dependent on 'd', but is dependent on 'c' and 's'. As such, the aggregate of r.similarity is only dependent on 's', not 's' and 'd'. As a result, your query is computing the same average for each instance of 's' again and again for each entity 'd' related to 's'.

Your query then orders the rows by the average in order to take the top 100. This will not give you the top 100, since each average is repeated a number of times depending on how many 'd' nodes were related to each 's'.

In the end, I think you want something more like the following.

match (c:courses)-[r]-(s:job_skills)
where ID(c) in $user_courses
with s, avg(r.similarity) as average_sim order by average_sim desc limit 100
match (s)-[]-(d:skill_group)
with d, avg(average_sim) as d_average_sim order by d_average_sim desc
return d.name as skill_group_name,ID(d) as skill_group_id, d_average_sim

Addressing your comments, your second query starts with 2000 course nodes and the original one starts with 100 course nodes. That means the second query starts with 20x the number of course nodes. Depending on the relationships through to the 'skill_group' nodes, that could be a huge increase in the number of rows to process. I suspect if you take a random 100 course nodes from the new set of 2000 course nodes, you may get similar query times as your original 100 courses.

Hi @glilienfield ,

Good points about the query semantics! The performance is confusing because looking at the two queries with plans above, both are anchored by only 4 nodes selected by id(). I was expecting the second query plan to show some step that was doing significantly more work, because it was processing much more of the graph. Yet that doesn't seem to be the case.

I'd be tempted to profile the performance of smaller paths to narrow down what part of the query is taking more time.

For example, something simple like:

PROFILE 
MATCH (c:courses)-[r]-(s:job_skills)
WHERE id(c) in [3230, 3279, 3235, 3315]
RETURN c,r,s LIMIT 100

I misread the post, so my comments about the number of nodes does not apply.

I agree with you. The plans look very similar, so they don’t provide details of the disparity in durations.

I agree with you the he should break the query in parts to see where it differs.

I wonder what the performance would be if he mixed ids from both sets, or took the four ‘bad’ ids and substituted one ‘good’ id at a time to see how the performance is effected. It has to make a transition from the 12 sec duration to the 1 sec duration during the substitution progression.

First I would be very hesitant to take the query graph at face value as it is often wrong when it comes to results estimation.

Second: always put an edge type or you run the risk of having a lot of nodes filtered out purely by node type, which can be very bad !

Just some 2c ! :)

Sorry, for the lack of response as I've had quite a busy 10 days.
Providing an update on this issue.

I'm not able to share my graph dataset, however I did test run some of the tests using the sample query you shared earlier

PROFILE 
MATCH (c:courses)-[r]-(s:job_skills)
WHERE id(c) in [3230, 3279, 3235, 3315]
RETURN c,r,s LIMIT 100

I tested it with:

  1. four ids selected from the first 100 courses
  2. fours ids selected from the other 2000 courses
  3. two ids from 1) and two ids from 2) as @glilienfield suggested

What I found for all three cases was that the profile plans were basically identical, (with identical db hits and runtime being within 10-50ms).

Afterwards, Last week, I needed to make some big changes to the course nodes and most of the similarity edges to the skill nodes. So I opted to deleting the course nodes and relationships and re-importing them. However, instead of creating 10K edges for each course node, I only created 200 edges between a course node and skill nodes (based on the top 200 similar skills to each course). In the original code, I am not utilising all 10K skill edges between each course, so it's ultimately unnecessary for me to keep all of those.

This update in the graph has helped the queries run better. And selecting course IDs from the other 2000, no longer causes the query to run much slower than when you only select from the original 100 courses.

That being said, I was never able to find out the original cause (or understand) why queries using course IDs from the other 2000 courses ran so much slower (despite all course nodes having the same number of edges 10K to skill nodes originally)

that was real odd. Glad you figured out a working solution.