Yet another Cartesian Product problem

Cartesian Products lie at the heart of pretty much every query I've been troubleshooting on our Neo servers over the last nine months, however I can't find a generic solution to this one. The query is produced by query builder code in response to user activity in the UI.

XX has been used to hide certain pieces of private data.

The following query uses multiple labels to avoid joins where the joined nodes aren't needed as part of the output.

PROFILE
MATCH (ph:PH_JP_XXXX:PH_OF_XXXX:PH_BU_XXXX:PH_EC_XXXX)-[:PH_OD]->(od:OD_PR_XXXX)
WHERE ph.PH_XXX_PhysicalDevices_XXXS.epochMillis >= {p0}
RETURN count(DISTINCT ph) AS Count_PH_Id

This produces a query which runs as follows:

Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 99032348 total db hits in 53971 ms


The most annoying thing about this problem is that if I simply reverse the expression of the query (without any changes to the logic), the planner chooses the correct plan, avoiding a cartesian product. Like this:

PROFILE
MATCH (od:OD_PR_XXXX)<-[:PH_OD]-(ph:PH_JP_XXXX:PH_OF_XXXX:PH_BU_XXXX:PH_EC_XXXX)
WHERE ph.PH_XXX_PhysicalDevices_XXXS.epochMillis >= {p0}
RETURN count(DISTINCT ph) AS Count_PH_Id

Producing a nice quick execution as follows:

Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 66466 total db hits in 172 ms

That's ~313 times faster just because the planner stopped being stupid.


Unfortunately because these queries are built dynamically, I can't manually tweak them. I have to find a generic way to tell the planner to avoid a cartesian join. I have tried using the USING JOIN ON hint to no avail. I even resorted to executing part of the query, COLLECTing and then UNWINDing it to try and influence when filters occur, but (perhaps thankfully) this didn't work - it seemed to use up all available memory.

Is there any way to regain control of the query plan?

Which version of Neo4j is being used here (we know it's 3.5.x, but which patch release)? I agree that's a horrible plan, we'll want to figure out why it's being generated.

Do you have an index on one of ph's labels and the PH_XXX_PhysicalDevices_XXXS property? Looks like this is a temporal type of some kind, but the WHERE clause here, a predicate against it's epochMillis, won't utilize an index, and needs to be changed such that you're comparing against another temporal type.

For example, if this is a dateTime type, then you could do something like:

WHERE ph.PH_XXX_PhysicalDevices_XXXS >= dateTime(epochMillis:{p0})

Version is 3.5.3 (enterprise)

If I take out the WHERE clause entirely, it exhibits the same behaviours as described above, only exacerbated by the increased size of the returned dataset. (But yes, the property is a dateTime type and we have indexes on it).

Have you attempted on a newer patch version? 3.5.17 is the latest version of 3.5.x, there may be some planner fixes that may address the issue. At the least we could use that to rule out any already-fixed bugs.

We don't have a spare server within the protected production environment where this data is contractually obligated to be at rest. Patching is a non trivial operation which we would need to schedule for next weekend (or perhaps the one after).

After days of setting up a server in the production environment to test this upgrade I can confirm that the problem is exactly the same as before.