Any ideas on speeding up a query with DISTINCT and LIMIT already in place

robikshrestha · September 19, 2019, 2:31am

My graph is defined as follows:

Nodes:

:Object(id, name)
:Attribute(name, concept_type)

Edges: Objects may point to Attribute using :HAS_ATTRIBUTE relation type.

I have about 1M Object nodes and 1000 Attribute nodes. All of the properties are indexed.

I am trying to get objects whose names are same but have different attribute names for same attribute type (e.g., men with different colored hairs). The query is shown below. Despite placing a limit on the following query, it is running very slow.

MATCH (o1:Object)-[:HAS_ATTRIBUTE]->(a1:Attribute),
(o2:Object)-[:HAS_ATTRIBUTE]->(a2:Attribute)
WHERE o1.id > o2.id
AND o1.name = o2.name
AND a1.name > a2.name
AND a1.concept_type = a2.concept_type
AND a1.concept_type = 'color'
RETURN DISTINCT o1.name,a1.name,a2.name, a1.concept_type
LIMIT 1000

Could you provide tips on speeding this up. Would I be better off making attributes properties of Object instead of having edges from Object to Attribute? My only concern is same object can have multiple values for same attribute type.
Neo4j version = 3.5.6

andrew_bowman · September 19, 2019, 3:08am

Can you provide the PROFILE plan of the query, and make sure you expand all elements of the plan (using the double-down arrow in the lower right corner of the result pane) before you export it?

michael.hunger · September 19, 2019, 4:43am

I guess here you have a problem of "overindexing"

Using an USING INDEX o2:Object(name) might help.
and possibly: USING INDEX a1:Attribute(concept_type)

MATCH (o1:Object)-[:HAS_ATTRIBUTE]->(a1:Attribute)
WHERE a1.concept_type = 'color'
MATCH (o2:Object)-[:HAS_ATTRIBUTE]->(a2:Attribute)
USING INDEX o2:Object(name)
WHERE o1.name = o2.name
AND o1.id > o2.id
AND a1.name > a2.name
AND a1.concept_type = a2.concept_type
RETURN DISTINCT o1.name,a1.name,a2.name, a1.concept_type
LIMIT 1000

robikshrestha · September 19, 2019, 2:12pm

@andrew_bowman
I placed an additional filter o1.name='wall' and reduced the LIMIT to 100, because previous query was taking a very long time. Here is the PROFILE for this query:

PROFILE MATCH (o1:Object)-[:HAS_ATTRIBUTE]->(a1:Attribute),
(o2:Object)-[:HAS_ATTRIBUTE]->(a2:Attribute)
WHERE o1.id > o2.id
AND o1.name = o2.name
AND a1.name > a2.name
AND a1.concept_type = a2.concept_type
AND o1.name = 'wall'
AND a1.concept_type = 'color'
RETURN DISTINCT o1.name,a1.name,a2.name, a1.concept_type
LIMIT 100

robikshrestha · September 19, 2019, 2:18pm

@michael.hunger
I got an error saying the database failed to use the hints of the query:

Could not solve these hints: `USING INDEX o2:Object(name)`

Plan Limit(SignedDecimalIntegerLiteral(1000), DoNotIncludeTies) {
  LHS -> Distinct(Map(o1.name -> Property(Variable(o1),PropertyKeyName(name)), a1.name -> Property(Variable(a1),PropertyKeyName(name)), a2.name -> Property(Variable(a2),PropertyKeyName(name)), a1.concept_type -> Property(Variable(a1),PropertyKeyName(concept_type)))) {
    LHS -> Selection(Ands(Set(Equals(Property(Variable(a1),PropertyKeyName(concept_type)),Property(Variable(a2),PropertyKeyName(concept_type))), Equals(Property(Variable(o1),PropertyKeyName(name)),Property(Variable(o2),PropertyKeyName(name))), AndedPropertyInequalities(Variable(a1),Property(Variable(a1),PropertyKeyName(name)),GreaterThan(Property(Variable(a1),PropertyKeyName(name)),Property(Variable(a2),PropertyKeyName(name))))))) {
      LHS -> Apply() {
        LHS -> Selection(Ands(Set(HasLabels(Variable(o2),List(LabelName(Object)))))) {
          LHS -> Expand(a2, INCOMING, List(RelTypeName(HAS_ATTRIBUTE)), o2,   UNNAMED101, ExpandAll) {
            LHS -> NodeByLabelScan(a2, LabelName(Attribute), Set()) {}
          }
        }
        RHS -> Selection(Ands(Set(In(Property(Variable(a1),PropertyKeyName(concept_type)),ListLiteral(List(Parameter(  AUTOSTRING0,String)))), HasLabels(Variable(a1),List(LabelName(Attribute)))))) {
          LHS -> Expand(o1, OUTGOING, List(RelTypeName(HAS_ATTRIBUTE)), a1,   UNNAMED18, ExpandAll) {
            LHS -> NodeUniqueIndexSeek(o1, LabelToken(Object,LabelId(3)), List(IndexedProperty(PropertyKeyToken(id,PropertyKeyId(8)),DoNotGetValue)), RangeQueryExpression(InequalitySeekRangeWrapper(RangeGreaterThan(ExclusiveBound(Property(Variable(o2),PropertyKeyName(id)))))), Set(o2, a2,   UNNAMED101), IndexOrderNone) {}
          }
        }
      }
    }
  }
}

I have re-checked my indices and Object.name is indexed.

andrew_bowman · September 19, 2019, 3:39pm

Can you provide the version of Neo4j you're using? I recall we had a bug like that patched up a little while back.

robikshrestha · September 19, 2019, 4:44pm

The neo4j version is 3.5.6.

michael.hunger · September 19, 2019, 6:38pm

can you raise an GitHub issue with this? that would be really helpful.

andrew_bowman · September 19, 2019, 6:48pm

That does look buggy.

You can force it to work by introducing a variable in a WITH clause between the matches, which will enforce the ordering of the matches (I suspect the planner is reordering to attempt to execute the second match first, which won't work):

explain
MATCH (o1:Object)-[:HAS_ATTRIBUTE]->(a1:Attribute)
WHERE a1.concept_type = 'color'
WITH o1, a1, 1 as ignored
MATCH (o2:Object)-[:HAS_ATTRIBUTE]->(a2:Attribute)
USING INDEX o2:Object(name)
WHERE o2.name = o1.name
AND o1.id > o2.id
AND a1.name > a2.name
AND a1.concept_type = a2.concept_type
RETURN DISTINCT o1.name,a1.name,a2.name, a1.concept_type
LIMIT 1000

robikshrestha · September 19, 2019, 7:19pm

I have filed a bug report here: "Could not solve these hints" error with USING INDEX · Issue #12296 · neo4j/neo4j · GitHub. Please let me know if you need additional info.

robikshrestha · September 19, 2019, 7:29pm

Hi,

This is pure wizardry! The statement you provided executed in 2 secs, whereas my original MATCH query has not completed in like 5 mins.

I am providing query plans for both:

Slow query:

MATCH (o1:Object)-[:HAS_ATTRIBUTE]->(a1:Attribute),
(o2:Object)-[:HAS_ATTRIBUTE]->(a2:Attribute)
WHERE o1.id > o2.id
AND o1.name = o2.name
AND a1.name > a2.name
AND a1.concept_type = a2.concept_type
AND a1.concept_type = 'color'
RETURN DISTINCT o1.name,a1.name,a2.name, a1.concept_type
LIMIT 1000

Fast Query:

MATCH (o1:Object)-[:HAS_ATTRIBUTE]->(a1:Attribute)
WHERE a1.concept_type = 'color'
WITH o1, a1, 1 as ignored
MATCH (o2:Object)-[:HAS_ATTRIBUTE]->(a2:Attribute)
USING INDEX o2:Object(name)
WHERE o2.name = o1.name
AND o1.id > o2.id
AND a1.name > a2.name
AND a1.concept_type = a2.concept_type
RETURN DISTINCT o1.name,a1.name,a2.name, a1.concept_type
LIMIT 1000

robikshrestha · September 19, 2019, 8:36pm

Why is "1 as ignored" necessary here?

andrew_bowman · September 19, 2019, 8:50pm

For consecutive MATCH clauses, connected by simple WITH clauses, the planner has the option of reordering and choosing where to start and how to expand the resulting pattern(s). I believe the problem happening in this query is that we're providing a planner hint, but because the predicate for that hint relies upon a value from the previous match, we need that first match to execute first. But the hint itself is for finding a starting place in the graph. The planner should be able to figure out the right order of execution but it's tripping on something, thus it looks like buggy behavior.

When we introduce a new variable in a WITH clause, it forces the ordering. The first MATCH will happen first, then this new variable will be introduced in scope, then the second MATCH executed using the index hint. No chance of reordering, and it forces the index hint to only be used after the first match has executed, ensuring we have the value from o1.name needed to fulfill the hint.

robikshrestha · September 20, 2019, 1:04am

Thank you. That makes sense now. I used it in other "inner join" kinda queries too, and it made things much faster because index was being used.

Of course, this particular trick did require knowledge of how query planner works. Any chance you have a list of such tips and tricks?

Topic		Replies	Views
Performance query over millions of relationships Cypher	2	2487	January 31, 2020
Querying relationships slow performance Cypher performance , cypher , relationship	4	1919	October 15, 2020
Is it possible to speed up relationship queries? Neo4j Graph Platform migrated	1	187	October 24, 2022
Optimising query performance with a relatively simple match Cypher performance	3	633	July 3, 2020
Using indexed nodes and simple queries my cypher queries are still taking around 500ms. Can this be further optimized? Looking for advice Cypher performance , cypher	5	396	December 18, 2020

Any ideas on speeding up a query with DISTINCT and LIMIT already in place

Related topics