Cartesian product when importing distance matrix

andrej.liebert257 · August 10, 2020, 5:48am

So, for my bachelor thesis I have to import a distance matrix in Neo4j and got the following message:

This query builds a cartesian product between disconnected patterns.

my data set is destributed into two files, one for nodes one for edges.
Nodes:

ID; Name
1;"AANAUSF000000AG0149"
2;"AANAUSF000000AG0150"
3;"AANAUSM000000000Z24"
4;"AANAUSM000000000Z28"
5;"AANAUSM000000AG0152"

Edges:

startID;endID;distance
1;2;17
1;3;21
1;4;23
1;5;30
2;3;24
2;4;28
2;5;35
3;4;18
3;5;27
4;5;17

starID and endID refere to the same kind of nodes and distance is the value how far apart those samples are from each other. The relationship file represents a (n-1) x (n-1) matrix in which symmetric pairings are not significant ( [a,b] == [b,a] ) as the diagonal , which is always 0 ( [a,a] == 0 ). The final matrix represents alsmot 60 million relations and many of them could have the same distance like [1,2] and [4,5].

I tried the following instruction, leading me to the inevitable cartesian warning:

LOAD CSV WITH HEADERS FROM "file:///Kanten.csv" AS row
FIELDTERMINATOR ';'
Match (c1:Cow {id:toInteger(row.startID)}), (c2:Cow{id:toInteger(row.endID)})
CREATE (c1)-[d:DISTANCE]->(c2);

here is also my plan

My question is now: Is there a work around for this problem or is this the end for my thesis?
I appreciate every advice I can take!

cobra · August 10, 2020, 6:18am

Hello @andrej.liebert257

LOAD CSV WITH HEADERS FROM "file:///Kanten.csv" AS row FIELDTERMINATOR ';'
MATCH (c1:Cow {id: toInteger(row.startID)})
MATCH (c2:Cow {id: toInteger(row.endID)})
CREATE (c1)-[d:DISTANCE]->(c2)

To make the load faster, did you create a UNIQUE CONSTRAINT on the id property?

Regards,
Cobra

andrej.liebert257 · August 11, 2020, 8:45pm

Hey @cobra,

It solved the problem! I implemented a few CONSTRAINTS:

(1) CREATE CONSTRAINT cow_constraint ON (c:Cow) ASSERT c.id IS UNIQUE
(2) CREATE CONSTRAINT cow_exists_constraint ON (c:Cow) ASSERT EXISTS (c.id)
(3) CREATE CONSTRAINT cow_name_exists_constraint ON (c:Cow) ASSERT EXISTS (c.name)
(4) CREATE CONSTRAINT node_key_constraint ON (c:Cow) ASSERT (c.id, c.name) IS NODE KEY
(5) CREATE CONSTRAINT distance_exists ON ()-[d:DISTANCE]-() ASSERT EXISTS (d.distance)

(1) to (4) worked just fine, but (5), creation of a relationship property existence constraint. resolved in the following message:

Relationship(0) with type DISTANCE must have the property distance

To be honest, I don't know if everyone was necessary. I wonder how those CONSTRAINTS resulted into a faster loading. In the manual I found this:

"When using MERGE or MATCH with LOAD CSV we need to make sure we have an index or a unique constraint on the property that we are merging on. This will ensure that the query executes in a performant way."

On 5.4 Constraints it sais:
(I) Adding a unique property constraint on a property will also add a single-property index on that property, so such an index cannot be added separately.
(II) Adding a node key constraint for a set of properties will also add a composite index on those properties, so such an index cannot be added separately.

So it has something to be with indexing. I found this quote on 5.2.1 Indexing Introduction
"This [Indexing] comes at the cost of additional storage space and slower writes, so deciding what to index and what not to index is an important and often non-trivial task."

Concidering that my project has to deal with a huge data volume, it would make sense to have only the required constraints. But which constraints are necessary and which don't? I also build an index
CREATE INDEX FOR (c:COW) ON (c.name) just in case I'll need it for performance. In the end, it might be redundant - because of (I)/(II) ... ? I will try to get into Indexes, but wouldn't decline a more understandable source than 5.2 to dig into the theme.

After the relationship constraint (4) was deleted, I used your code and it worked. Howerever,
I am still curious why

MATCH (c1:Cow {id:toInteger(row.startID)}), (c2:Cow{id:toInteger(row.endID)})

had to be separated into

MATCH (c1:Cow {id: toInteger(row.startID)})
MATCH (c2:Cow {id: toInteger(row.endID)})

cobra · August 12, 2020, 6:28am

You need one UNIQUE CONSTRAINT on a propery, usually, it's an ID.
You can also create multiple indexes that will be used in WHERE clauses, so if you always use a property in your WHERE clauses, you can add an index on it

Between the two codes, it should not have any difference but I prefer the second syntax, it's more clear

anthapu · August 12, 2020, 10:16am

Your dataset matches the format neo4j admin import format. If this is the first time import you can do this pretty fast with this option.

Please take a look at

https://neo4j.com/docs/operations-manual/current/tools/import/syntax/

Topic		Replies	Views
Cartesian product issue Desktop cypher	3	480	May 2, 2023
Avoid cartesian product when create relationships Cypher cypher	10	4731	July 18, 2024
“Query Optimization” : Neo4j Query builds a cartesian product between disconnected patterns - Cypher performance , cypher , relationship	3	1793	October 5, 2020
This tutorial should explain how to avoid Cartesian Products Feedback & Requests	2	1574	November 26, 2020
Neo4j browser exercises suggests queries that leads to "cartesian product" warning Feedback & Requests	5	2624	July 26, 2019

Cartesian product when importing distance matrix

Related topics