I'm trying to do some basic analysis on a CSV import before committing the content to my graph model. In short, there are two key properties that link to a name, and I expect that the same combination of those two key properties will link to the same name.
Property 1 (mo_hash) is never empty and is, architecturally, a grouping of property 2. Property 2 (licensee_gln) may be empty, in which case it's not subject to analysis, but if not empty, it should be unique within property 1. This uniqueness is verified by first ensuring that I'm not looking at the same node and then comparing against property 3 (company_name). I then return any instances where property 3 differs between two nodes with what should be the unique key.
To optimize this, I have a few indexes:
- Index 1 on property 1 (mo_hash), redundant given index 3.
- Index 2 on property 2 (licensee_gln), used to filter out records where property 2 is blank.
- Index 3 on property 1 and property 2 (mo_hash + licensee_gln).
The query starts by filtering on property 2, searching for non-blank records, which I expect to use index 2. It then matches on property 1 and property 2, which I expect to use index 3. I then apply additional filters to find the nodes of interest.
That's fine in theory, but in practice the match on property 1 and property 2 doesn't use the index, resulting in a query that will likely run for so long that we'll see the heat death of the universe before I get an answer. When I ask to "explain" the query, I see that the initial filter uses an index as expected, but the next match is a NodeByLabelScan.
Here's the query:
match (r1:grp_raw)
where r1.licensee_gln <> ""
match (r2:grp_raw { mo_hash: r1.mo_hash, licensee_gln: r1.licensee_gln })
where id(r2)<> id(r1) and r2.company_name <> r1.company_name
return r1.MO, r1.licensee_gln, r1.company_name as name1, r2.company_name as name2
The "explain" diagram is below. What am I missing?