cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! Site migration is underway. Phase 2: migrate recent content

Cartesian product when importing distance matrix

andrej_liebert2
Node Link

So, for my bachelor thesis I have to import a distance matrix in Neo4j and got the following message:

This query builds a cartesian product between disconnected patterns.

my data set is destributed into two files, one for nodes one for edges.
Nodes:

ID; Name
1;"AANAUSF000000AG0149"
2;"AANAUSF000000AG0150"
3;"AANAUSM000000000Z24"
4;"AANAUSM000000000Z28"
5;"AANAUSM000000AG0152"

Edges:

startID;endID;distance
1;2;17
1;3;21
1;4;23
1;5;30
2;3;24
2;4;28
2;5;35
3;4;18
3;5;27
4;5;17

starID and endID refere to the same kind of nodes and distance is the value how far apart those samples are from each other. The relationship file represents a (n-1) x (n-1) matrix in which symmetric pairings are not significant ( [a,b] == [b,a] ) as the diagonal , which is always 0 ( [a,a] == 0 ). The final matrix represents alsmot 60 million relations and many of them could have the same distance like [1,2] and [4,5].

I tried the following instruction, leading me to the inevitable cartesian warning:

LOAD CSV WITH HEADERS FROM "file:///Kanten.csv" AS row
FIELDTERMINATOR ';'
Match (c1:Cow {id:toInteger(row.startID)}), (c2:Cow{id:toInteger(row.endID)})
CREATE (c1)-[d:DISTANCE]->(c2);

here is also my plan

My question is now: Is there a work around for this problem or is this the end for my thesis?
I appreciate every advice I can take!

1 ACCEPTED SOLUTION

Cobra
Ninja
Ninja

Hello @andrej.liebert257

LOAD CSV WITH HEADERS FROM "file:///Kanten.csv" AS row FIELDTERMINATOR ';'
MATCH (c1:Cow {id: toInteger(row.startID)})
MATCH (c2:Cow {id: toInteger(row.endID)})
CREATE (c1)-[d:DISTANCE]->(c2)

To make the load faster, did you create a UNIQUE CONSTRAINT on the id property?

Regards,
Cobra

View solution in original post

4 REPLIES 4

Cobra
Ninja
Ninja

Hello @andrej.liebert257

LOAD CSV WITH HEADERS FROM "file:///Kanten.csv" AS row FIELDTERMINATOR ';'
MATCH (c1:Cow {id: toInteger(row.startID)})
MATCH (c2:Cow {id: toInteger(row.endID)})
CREATE (c1)-[d:DISTANCE]->(c2)

To make the load faster, did you create a UNIQUE CONSTRAINT on the id property?

Regards,
Cobra

andrej_liebert2
Node Link

Hey @Cobra,

It solved the problem! I implemented a few CONSTRAINTS:

(1) CREATE CONSTRAINT cow_constraint ON (c:Cow) ASSERT c.id IS UNIQUE
(2) CREATE CONSTRAINT cow_exists_constraint ON (c:Cow) ASSERT EXISTS (c.id)
(3) CREATE CONSTRAINT cow_name_exists_constraint ON (c:Cow) ASSERT EXISTS (c.name)
(4) CREATE CONSTRAINT node_key_constraint ON (c:Cow) ASSERT (c.id, c.name) IS NODE KEY
(5) CREATE CONSTRAINT distance_exists ON ()-[d:DISTANCE]-() ASSERT EXISTS (d.distance)

(1) to (4) worked just fine, but (5), creation of a relationship property existence constraint. resolved in the following message:

Relationship(0) with type DISTANCE must have the property distance

To be honest, I don't know if everyone was necessary. I wonder how those CONSTRAINTS resulted into a faster loading. In the manual I found this:

"When using MERGE or MATCH with LOAD CSV we need to make sure we have an index or a unique constraint on the property that we are merging on. This will ensure that the query executes in a performant way."

On 5.4 Constraints it sais:
(I) Adding a unique property constraint on a property will also add a single-property index on that property, so such an index cannot be added separately.
(II) Adding a node key constraint for a set of properties will also add a composite index on those properties, so such an index cannot be added separately.

So it has something to be with indexing. I found this quote on 5.2.1 Indexing Introduction
"This [Indexing] comes at the cost of additional storage space and slower writes, so deciding what to index and what not to index is an important and often non-trivial task."

Concidering that my project has to deal with a huge data volume, it would make sense to have only the required constraints. But which constraints are necessary and which don't? I also build an index
CREATE INDEX FOR (c:COW) ON (c.name) just in case I'll need it for performance. In the end, it might be redundant - because of (I)/(II) ... ? I will try to get into Indexes, but wouldn't decline a more understandable source than 5.2 to dig into the theme.

After the relationship constraint (4) was deleted, I used your code and it worked. Howerever,
I am still curious why

MATCH (c1:Cow {id:toInteger(row.startID)}), (c2:Cow{id:toInteger(row.endID)})

had to be separated into

MATCH (c1:Cow {id: toInteger(row.startID)})
MATCH (c2:Cow {id: toInteger(row.endID)})

Cobra
Ninja
Ninja

You need one UNIQUE CONSTRAINT on a propery, usually, it's an ID.
You can also create multiple indexes that will be used in WHERE clauses, so if you always use a property in your WHERE clauses, you can add an index on it

Between the two codes, it should not have any difference but I prefer the second syntax, it's more clear

anthapu
Graph Fellow

Your dataset matches the format neo4j admin import format. If this is the first time import you can do this pretty fast with this option.

Please take a look at

https://neo4j.com/docs/operations-manual/current/tools/import/syntax/