Any ideas on speeding up a query with DISTINCT and LIMIT already in place

My graph is defined as follows:

Nodes:

:Object(id, name)
:Attribute(name, concept_type)

Edges: Objects may point to Attribute using :HAS_ATTRIBUTE relation type.

I have about 1M Object nodes and 1000 Attribute nodes. All of the properties are indexed.

I am trying to get objects whose names are same but have different attribute names for same attribute type (e.g., men with different colored hairs). The query is shown below. Despite placing a limit on the following query, it is running very slow.

MATCH (o1:Object)-[:HAS_ATTRIBUTE]->(a1:Attribute),
(o2:Object)-[:HAS_ATTRIBUTE]->(a2:Attribute)
WHERE o1.id > o2.id
AND o1.name = o2.name
AND a1.name > a2.name
AND a1.concept_type = a2.concept_type
AND a1.concept_type = 'color'
RETURN DISTINCT o1.name,a1.name,a2.name, a1.concept_type
LIMIT 1000

Could you provide tips on speeding this up. Would I be better off making attributes properties of Object instead of having edges from Object to Attribute? My only concern is same object can have multiple values for same attribute type.
Neo4j version = 3.5.6

Can you provide the PROFILE plan of the query, and make sure you expand all elements of the plan (using the double-down arrow in the lower right corner of the result pane) before you export it?

I guess here you have a problem of "overindexing"

Using an USING INDEX o2:Object(name) might help.
and possibly: USING INDEX a1:Attribute(concept_type)

MATCH (o1:Object)-[:HAS_ATTRIBUTE]->(a1:Attribute)
WHERE a1.concept_type = 'color'
MATCH (o2:Object)-[:HAS_ATTRIBUTE]->(a2:Attribute)
USING INDEX o2:Object(name)
WHERE o1.name = o2.name
AND o1.id > o2.id
AND a1.name > a2.name
AND a1.concept_type = a2.concept_type
RETURN DISTINCT o1.name,a1.name,a2.name, a1.concept_type
LIMIT 1000

@andrew.bowman
I placed an additional filter o1.name='wall' and reduced the LIMIT to 100, because previous query was taking a very long time. Here is the PROFILE for this query:

PROFILE MATCH (o1:Object)-[:HAS_ATTRIBUTE]->(a1:Attribute),
(o2:Object)-[:HAS_ATTRIBUTE]->(a2:Attribute)
WHERE o1.id > o2.id
AND o1.name = o2.name
AND a1.name > a2.name
AND a1.concept_type = a2.concept_type
AND o1.name = 'wall'
AND a1.concept_type = 'color'
RETURN DISTINCT o1.name,a1.name,a2.name, a1.concept_type
LIMIT 100

@michael.hunger
I got an error saying the database failed to use the hints of the query:

Could not solve these hints: `USING INDEX o2:Object(name)`

Plan Limit(SignedDecimalIntegerLiteral(1000), DoNotIncludeTies) {
  LHS -> Distinct(Map(o1.name -> Property(Variable(o1),PropertyKeyName(name)), a1.name -> Property(Variable(a1),PropertyKeyName(name)), a2.name -> Property(Variable(a2),PropertyKeyName(name)), a1.concept_type -> Property(Variable(a1),PropertyKeyName(concept_type)))) {
    LHS -> Selection(Ands(Set(Equals(Property(Variable(a1),PropertyKeyName(concept_type)),Property(Variable(a2),PropertyKeyName(concept_type))), Equals(Property(Variable(o1),PropertyKeyName(name)),Property(Variable(o2),PropertyKeyName(name))), AndedPropertyInequalities(Variable(a1),Property(Variable(a1),PropertyKeyName(name)),GreaterThan(Property(Variable(a1),PropertyKeyName(name)),Property(Variable(a2),PropertyKeyName(name))))))) {
      LHS -> Apply() {
        LHS -> Selection(Ands(Set(HasLabels(Variable(o2),List(LabelName(Object)))))) {
          LHS -> Expand(a2, INCOMING, List(RelTypeName(HAS_ATTRIBUTE)), o2,   UNNAMED101, ExpandAll) {
            LHS -> NodeByLabelScan(a2, LabelName(Attribute), Set()) {}
          }
        }
        RHS -> Selection(Ands(Set(In(Property(Variable(a1),PropertyKeyName(concept_type)),ListLiteral(List(Parameter(  AUTOSTRING0,String)))), HasLabels(Variable(a1),List(LabelName(Attribute)))))) {
          LHS -> Expand(o1, OUTGOING, List(RelTypeName(HAS_ATTRIBUTE)), a1,   UNNAMED18, ExpandAll) {
            LHS -> NodeUniqueIndexSeek(o1, LabelToken(Object,LabelId(3)), List(IndexedProperty(PropertyKeyToken(id,PropertyKeyId(8)),DoNotGetValue)), RangeQueryExpression(InequalitySeekRangeWrapper(RangeGreaterThan(ExclusiveBound(Property(Variable(o2),PropertyKeyName(id)))))), Set(o2, a2,   UNNAMED101), IndexOrderNone) {}
          }
        }
      }
    }
  }
}

I have re-checked my indices and Object.name is indexed.

Can you provide the version of Neo4j you're using? I recall we had a bug like that patched up a little while back.

The neo4j version is 3.5.6.

can you raise an GitHub issue with this? that would be really helpful.

That does look buggy.

You can force it to work by introducing a variable in a WITH clause between the matches, which will enforce the ordering of the matches (I suspect the planner is reordering to attempt to execute the second match first, which won't work):

explain
MATCH (o1:Object)-[:HAS_ATTRIBUTE]->(a1:Attribute)
WHERE a1.concept_type = 'color'
WITH o1, a1, 1 as ignored
MATCH (o2:Object)-[:HAS_ATTRIBUTE]->(a2:Attribute)
USING INDEX o2:Object(name)
WHERE o2.name = o1.name
AND o1.id > o2.id
AND a1.name > a2.name
AND a1.concept_type = a2.concept_type
RETURN DISTINCT o1.name,a1.name,a2.name, a1.concept_type
LIMIT 1000
1 Like

I have filed a bug report here: https://github.com/neo4j/neo4j/issues/12296. Please let me know if you need additional info.

Hi,

This is pure wizardry! The statement you provided executed in 2 secs, whereas my original MATCH query has not completed in like 5 mins.

I am providing query plans for both:

Slow query:

MATCH (o1:Object)-[:HAS_ATTRIBUTE]->(a1:Attribute),
(o2:Object)-[:HAS_ATTRIBUTE]->(a2:Attribute)
WHERE o1.id > o2.id
AND o1.name = o2.name
AND a1.name > a2.name
AND a1.concept_type = a2.concept_type
AND a1.concept_type = 'color'
RETURN DISTINCT o1.name,a1.name,a2.name, a1.concept_type
LIMIT 1000

Fast Query:

MATCH (o1:Object)-[:HAS_ATTRIBUTE]->(a1:Attribute)
WHERE a1.concept_type = 'color'
WITH o1, a1, 1 as ignored
MATCH (o2:Object)-[:HAS_ATTRIBUTE]->(a2:Attribute)
USING INDEX o2:Object(name)
WHERE o2.name = o1.name
AND o1.id > o2.id
AND a1.name > a2.name
AND a1.concept_type = a2.concept_type
RETURN DISTINCT o1.name,a1.name,a2.name, a1.concept_type
LIMIT 1000

Why is "1 as ignored" necessary here?

For consecutive MATCH clauses, connected by simple WITH clauses, the planner has the option of reordering and choosing where to start and how to expand the resulting pattern(s). I believe the problem happening in this query is that we're providing a planner hint, but because the predicate for that hint relies upon a value from the previous match, we need that first match to execute first. But the hint itself is for finding a starting place in the graph. The planner should be able to figure out the right order of execution but it's tripping on something, thus it looks like buggy behavior.

When we introduce a new variable in a WITH clause, it forces the ordering. The first MATCH will happen first, then this new variable will be introduced in scope, then the second MATCH executed using the index hint. No chance of reordering, and it forces the index hint to only be used after the first match has executed, ensuring we have the value from o1.name needed to fulfill the hint.

1 Like

Thank you. That makes sense now. I used it in other "inner join" kinda queries too, and it made things much faster because index was being used.

Of course, this particular trick did require knowledge of how query planner works. Any chance you have a list of such tips and tricks?