Index performance - NodeByLabelScan on indexed columns

I'm trying to do some basic analysis on a CSV import before committing the content to my graph model. In short, there are two key properties that link to a name, and I expect that the same combination of those two key properties will link to the same name.

Property 1 (mo_hash) is never empty and is, architecturally, a grouping of property 2. Property 2 (licensee_gln) may be empty, in which case it's not subject to analysis, but if not empty, it should be unique within property 1. This uniqueness is verified by first ensuring that I'm not looking at the same node and then comparing against property 3 (company_name). I then return any instances where property 3 differs between two nodes with what should be the unique key.

To optimize this, I have a few indexes:

  1. Index 1 on property 1 (mo_hash), redundant given index 3.
  2. Index 2 on property 2 (licensee_gln), used to filter out records where property 2 is blank.
  3. Index 3 on property 1 and property 2 (mo_hash + licensee_gln).

The query starts by filtering on property 2, searching for non-blank records, which I expect to use index 2. It then matches on property 1 and property 2, which I expect to use index 3. I then apply additional filters to find the nodes of interest.

That's fine in theory, but in practice the match on property 1 and property 2 doesn't use the index, resulting in a query that will likely run for so long that we'll see the heat death of the universe before I get an answer. When I ask to "explain" the query, I see that the initial filter uses an index as expected, but the next match is a NodeByLabelScan.

Here's the query:

match (r1:grp_raw)
where r1.licensee_gln <> ""
match (r2:grp_raw { mo_hash: r1.mo_hash, licensee_gln: r1.licensee_gln })
where id(r2)<> id(r1) and r2.company_name <> r1.company_name
return r1.MO, r1.licensee_gln, r1.company_name as name1, r2.company_name as name2

The "explain" diagram is below. What am I missing?

The second MATCH statement is not using index_3. Solve with the USING keyword to hint the correct index to the query planner.

Note: Make sure you're correctly defining index_3 as a COMPOSITE INDEX. (Check with CALL db.indexes();

Define index_3 as a composite index

CREATE INDEX grp_composite FOR (r:grp_raw) ON ( r.mo_hash, r.licensee_gln );

HINT the composite index with USING

MATCH (r1:grp_raw)
WHERE r1.licensee_gln <> ""
MATCH (r2:grp_raw)
  USING INDEX r2:grp_raw(mo_hash, licensee_gln)
  WHERE
    r2.mo_hash = r1.mo_hash
    AND r2.licensee_gln = r1.licensee_gln
    AND r2.company_name <> r1.company_name
    // don't use the internal id(r1) unless you MUST.
    // WHERE id(r1) <> id(r2)
    //    is functionally equivalent to 
    // WHERE r1 <> r2
    //    which you already get more cheaply with the company_name check.
return r1.MO, r1.licensee_gln, r1.company_name as name1, r2.company_name as name2

See also:

Thanks for responding, and apologies for the delay, as I didn't get an email notification that there was a reply waiting.

First, thanks for the hint re: removing the node equality comparison. It's one of those blindingly obvious things after someone points it out. ;-)

Here's the relevant entry after "call db.indexes()":

27 "grp_raw_mo_licensee" "ONLINE" 100.0 "NONUNIQUE" "BTREE" "NODE" ["grp_raw"] ["mo_hash", "licensee_gln"] "native-btree-1.0"

Here's my modified query:

explain
match (r1:grp_raw)
where r1.licensee_gln <> ""
match (r2:grp_raw)
using index r2:grp_raw(mo_hash, licensee_gln)
where r2.mo_hash = r1.mo_hash
and r2.licensee_gln = r1.licensee_gln
and r2.company_name <> r1.company_name
return r1.MO, r1.licensee_gln, r1.company_name as name1, r2.company_name as name2

Here's the output:

Failed to fulfil the hints of the query.
Could not solve these hints: `USING INDEX r2:grp_raw(mo_hash, licensee_gln)`

Plan ProduceResult(List(r1.MO, r1.licensee_gln, name1, name2)) {
  LHS -> Projection(Map(r1.MO -> Property(Variable(r1),PropertyKeyName(MO)), r1.licensee_gln -> Property(Variable(r1),PropertyKeyName(licensee_gln)), name1 -> Property(Variable(r1),PropertyKeyName(company_name)), name2 -> Property(Variable(r2),PropertyKeyName(company_name)))) {
    LHS -> Selection(Ands(Set(Equals(Property(Variable(r2),PropertyKeyName(mo_hash)),Property(Variable(r1),PropertyKeyName(mo_hash))), Not(Equals(Property(Variable(r2),PropertyKeyName(company_name)),Property(Variable(r1),PropertyKeyName(company_name))))))) {
      LHS -> Apply() {
        LHS -> NodeByLabelScan(r2, LabelName(grp_raw), Set()) {}
        RHS -> Selection(Ands(Set(Not(In(Property(Variable(r1),PropertyKeyName(licensee_gln)),ListLiteral(List(Parameter(  AUTOSTRING0,String)))))))) {
          LHS -> NodeIndexSeek(r1, LabelToken(grp_raw,LabelId(13)), List(IndexedProperty(PropertyKeyToken(licensee_gln,PropertyKeyId(9048)),CanGetValue)), SingleQueryExpression(Property(Variable(r2),PropertyKeyName(licensee_gln))), Set(r2), IndexOrderNone) {}
        }
      }
    }
  }
}

What am I missing?

From your original post, although the variables seem switched between the query you provided and the EXPLAIN plan, the plan looks expected.

Your query will involve a NodeByLabelScan where the licensee_gln property <> "". There's no other way to plan that.

The other :grp_raw MATCH was correctly using an index seek based upon r1.

(note that the order in which I described is the actual order for the plan: the label provides values for which the index seek is performed).

So your query and your plan are actually correct here, you would need to change your criteria for the initial lookup if you want to get away from that label scan.

For example, if you made this change, removing all licensee_gln properties where the value was the empty string:

MATCH (n:grp_raw)
WHERE r1.licensee_gln = ""
REMOVE r1.licensee_gln

Then you could leverage the index for:

MATCH (n:grp_raw)
WHERE exists(r1.licensee_gln)
...

This is because only nodes with a value for the indexed property make it into the index. Nodes that do not have that property will not be in the index. So when you assert that the property must exist, then all values in the the index for :grp_raw(licensee_gln) will be returned.

That said, if you make that change, you will no longer be able to quickly lookup nodes for which the value doesn't exist.

MATCH (n:grp_raw)
WHERE NOT exists(r1.licensee_gln)
...

This would become a label scan operation, which would be less efficient than with your current approach:

MATCH (n:grp_raw)
WHERE r1.licensee_gln = ""
...

This one is currently an index seek, since an empty string is still a value, thus it is indexed.

1 Like

There's an index on licensee_gln as well. However, I have removed it as recommended and rewritten the first part of the query to use exists(). Despite that, I still get the same error.

Failed to fulfil the hints of the query.
Could not solve these hints: `USING INDEX r2:grp_raw(mo_hash, licensee_gln)`

You shouldn't need the index hint, drop the USING INDEX line, though we'll want to double-check the plan afterward.

Removing the "using" hint now gives me something much more efficient than I had before. What confuses me, though, is that there's no indication anywhere in this tree that the grp_raw(:mo_hash, :licensee_gln) index is being used; only the grp_raw(:licensee_gln) index is showing up. Neo4j doesn't appear to recognize that index at all, even when I give it a hint.