Loading edges for Parasise paper leads to cartesian product and slow creation of relationships

Hi there,

I have a neo4j question; I’m trying to load the panama papers and they have already prepared data. The nodes are fine but the edges give some trouble.

Here is the data; https://offshoreleaks.icij.org/pages/database and here are the first few lines;

"START_ID","TYPE","END_ID","link","start_date","end_date","sourceID","valid_until"
"85004927","registered_address","88000379","registered address","","","Paradise Papers - Aruba corporate registry","Aruba corporate registry data is current through 2016"
"85004928","registered_address","88016409","registered address","","","Paradise Papers - Aruba corporate registry","Aruba corporate registry data is current through 2016"
"85004929","registered_address","88005855","registered address","","","Paradise Papers - Aruba corporate registry","Aruba corporate registry data is current through 2016"

This is the cypher query;

USING PERIODIC COMMIT

LOAD CSV WITH HEADERS FROM 'file:///csv_panama_papers.2018-02-14/panama_papers.edges.csv' AS row

MATCH (startnode {node_id:row.START_ID}), (endnode {node_id:row.END_ID} )

CALL apoc.create.relationship(startnode, row.TYPE, {start_date:row.start_date, end_date:row.end_date, sourceID:row.sourceID, valid_until:row.valid_until}, endnode) YIELD rel

RETURN rel

When running this query I get a warning;

This query builds a cartesian product between disconnected patterns.

If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (endnode))

MATCH (startnode {node_id:row.START_ID}), (endnode {node_id:row.END_ID} )

I can see where this goes wrong however I didn’t find a solution yet, and in order to maintain speed for this I can use some help. Can you help me on this one?

Looked here first;

I know there are other similar topics but as a newbie that didn't solve my problem which first of all is my lack of knowledge. Thanks anyway,.

Jos

So when the intent is to match on specific nodes with the intent to create a relationship between them, there is no choice but to create a cartesian product between the nodes (which are currently not connected). So the general approach is correct and you can dismiss the warning, this is not the source of the slowdown.

The problem is your match on your start and end nodes. You're not using labels in your pattern, so this is performing an all nodes scan for each node, and that's happening per row in your CSV.

You should have labels present on both of these nodes, and you should have a supporting index on the label that you are using and the node_id property so those matches become quick.

Hi Andrew, thanks for clarifying this. I have a hard time understanding where to use or put these labels. Below a screenshot of the current labels and properties. image

I have set indexes via these commands;
CREATE INDEX ON:ENTITY(node_id)
CREATE INDEX ON:ADDRESS (node_id)
CREATE INDEX ON:INTERMEDIARIES (node_id)
CREATE INDEX ON:OFFICER (node_id)
CREATE INDEX ON:OTHER (node_id)

And the edges file contains multiple different relationsip types, hence the APOC procedure.
How would I need to adjust the query?

Solved it y adding a general label on all nodes and used that for indexing. Much faster.

USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM 'file:///csv_offshore_leaks.2018-02-14/offshore_leaks.edges.csv' AS row
MATCH (startnode:GENERAL {node_id:row.START_ID}), (endnode:GENERAL {node_id:row.END_ID} )
CALL apoc.create.relationship(startnode, row.TYPE, {start_date:row.start_date, end_date:row.end_date, sourceID:row.sourceID, valid_until:row.valid_until}, endnode) YIELD rel