Inserting a Relationship Post Database Setup

Hi I am new to Neo4j but have searched and tried to come to a resolution for a week now with no success. I have a DB with the OffShore_Leaks in it. I have imported the Nodes of Bahamas_Leaks and am trying to get the Relationships of Bahamas inserted.

I have filtered and created a filtered relationship with a header
node_1,rel_type,node_2,sourceID,valid_until,start_date,end_date
23000001,intermediary_of,20000035,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,
23000001,intermediary_of,20000033,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,
23000001,intermediary_of,20000041,Bahamas Leaks,The Bahamas Leaks data is current through early 2016.,,
....

And have checked that these IDs exist in the Intermediary and Entity Nodes.

I have created a number of Cyphers to import as the bulk importers must have nodes and seem to be mainly to instantiate DB only.

LOAD CSV WITH HEADERS FROM "http://IP_ADDRESS/bulk/import/intermediary_of.csv" AS row
WITH row WHERE row.rel_type = "intermediary_of"
MATCH (n1:Node) WHERE n1.node_id = row.node_1
MATCH (n2:Node) WHERE n2.node_id = row.node_2
CREATE (n1)-[:INTERMEDIARY_OF]->(n2);

Syntactically this looks to be correct
when run via the Desktop I get "(no changes, no records)".

Going in circles on this one.

Hi,

Looks like you imported bahamas_leaks_nodes only. Did you import bahamas_leaks_intermediary?

MATCH (n1:Node) WHERE n1.node_id = "23000001" is failing as this id does not exist in bahamas_leaks_nodes. This id exists in bahamas_leaks_intermediary.

Here is the schema that I used for offshore_leaks:

I have the correct scheme and all the data in from offshore_leaks, I have the nodes from bahamas_leaks and can search and find them individually,

I have changed my cypher and gone to the command line, hoping to get a better error code.

neo4j> LOAD CSV WITH HEADERS FROM "http://IP_ADDRESS/bulk/import/intermediary_of.csv" AS row
WITH row WHERE row.rel_type = "intermediary_of"
MATCH (n1:Node) WHERE n1.node_id = row.node_1
MATCH (n2:Node) WHERE n2.node_id = row.node_2
CREATE (n1)-[:INTERMEDIARY_OF]->(n2);

Connection to the database terminated. This can happen due to network instabilities, or due to restarts of the database

Everytime I run this cypher the database dies.

To note I also changed the import to be file:/// with the same results

Did you label the bahamas_leaks_nodes and the intermediary as 'Node'?
Intemediary nodes should have different label. In your MATCH the label is same for both MATCH statements.

Run this
MATCH (n1:Node) WHERE n1.node_id = "23000001" RETURN n1
and see if you get any result.

Yes I have tried multiple queries over the week
I have named lables inline with the following Cypher.

neo4j> LOAD CSV WITH HEADERS FROM "file:///intermediary_of.csv" AS row
WITH row WHERE row.rel_type = "intermediary_of"
MATCH (n1:Intermediary) WHERE n1.node_id = row.node_1
MATCH (n2:Entity) WHERE n2.node_id = row.node_2
CREATE (n1)-[:INTERMEDIARY_OF]->(n2);

....but still get the following error.

Connection to the database terminated. This can happen due to network instabilities, or due to restarts of the database

The Cypher

MATCH (n1:Node) WHERE n1.node_id = 23000001 RETURN n1

returns
n1
{
"sourceID": "Bahamas Leaks",
"name": "Internal User",
"valid_until": "The Bahamas Leaks data is current through early 2016.",
"node_id": 23000001
}

If the node_id is stored as integer then try this:

MATCH (n1:Node) WHERE n1.node_id = toInteger(row.node_1)
MATCH (n2:Node) WHERE n2.node_id = toInteger(row.node_2)

neo4j> MATCH (n1:Intermediary) WHERE n1.node_id = 23000001 RETURN n1;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
| n1 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
| (:Intermediary {sourceID: "Bahamas Leaks", name: "Internal User", valid_until: "The Bahamas Leaks data is current through early 2016.", node_id: 23000001}) |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+

1 row available after 103 ms, consumed after another 187 ms

neo4j> MATCH (n2:Entity) WHERE n2.node_id =20000035 RETURN n2;
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| n2 |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| (:Entity {sourceID: "Bahamas Leaks", name: "TINU HOLDINGS LIMITED", valid_until: "The Bahamas Leaks data is current through early 2016.", node_id: 20000035}) |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

1 row available after 98 ms, consumed after another 572 ms

neo4j> LOAD CSV WITH HEADERS FROM "file:///intermediary_of.csv" AS row
WITH row WHERE row.rel_type = "intermediary_of"
MATCH (n1:Intermediary) WHERE n1.node_id = row.node_1
MATCH (n2:Entity) WHERE n2.node_id = row.node_2
CREATE (n1)-[:INTERMEDIARY_OF]->(n2);

Connection to the database terminated. This can happen due to network instabilities, or due to restarts of the database

Caused the DB to exit.

neo4j> USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM "file:///intermediary_of.csv" AS row
WITH row WHERE row.rel_type = "intermediary_of"
MATCH (n1:Intermediary) WHERE n1.node_id = row.node_1
MATCH (n2:Entity) WHERE n2.node_id = row.node_2
CREATE (n1)-[:INTERMEDIARY_OF]->(n2);
0 rows available after 1207271 ms, consumed after another 2 ms

Although this did not cause the DB to exit this time in the prompt it was dead.

neo4j> USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM "file:///intermediary_of.csv" AS row
WITH row WHERE row.rel_type = "intermediary_of"
MATCH (n1:Intermediary) WHERE n1.node_id = toInteger(row.node_1)
MATCH (n2:Entity) WHERE n2.node_id = toInteger(row.node_2)
CREATE (n1)-[:INTERMEDIARY_OF]->(n2);
Connection to the database terminated. This can happen due to network instabilities, or due to restarts of the database

VERY Frustrating!

Please share your LOAD CSV code that you used to create Entity and Intemediary nodes. I will use that in my DB and check.

Hi I used neo4j-admin import to initially setup the DB with the nodes and a set of edges.

I am now trying to use load CVS to add some additional edges/relationships.

This is not straight forward, very unstable,

To load additional nodes I use

USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'http://IP/bahamas_leaks.nodes.intermediary.csv
' AS line CREATE (:Intermediaries { name: line.name, internal_id: line.internal_id, address: line.address, valid_until: line.valid_until, country_codes: line.country_codes, countries: line.countries, status: line.status, node_id: toInt(line.node_id), sourceID: line.sourceID})

Its all importing right and can get data out correctly.....schema is in as understood.

Very strange.

I have the Enterprise Version on an AWS cluster.

OK looks like I have a partial import of the edge/relation now when I slow down the processing via the periodic commit

Maybe a bad character in the cvs input stream

Good to hear that. All is well that ends well!

INot quite solved yet. The output on the state of the import from Neo4j gives little clues on what the issue is. Looked at the input stream and cant find an issue. Very painful!

Looked at the data and it is clean no special chars.

Stops exactly on 6500 entries.

Now I am wondering if I am hitting an Neo4j limitation.

This looks to be a heap size issue limitation with Neo4j.

Will look to do the following.

Break up import sizes into multiple imports.

Increase Java and Neo4j heapsize.

Increase the periodic commit even further.

Looks like Neo4j tries to do everything in memory before committing...that will always be a limiting factor in any systems architecture esp when dealing with big data....they prob. should look at doing some of this via virtual memory

I will next work out how to use apoc.periodic.iterate to see if that helps

There's no mention of indexes or constraints here. If you were running into heap issues when using periodic commit CSV loading using MATCHes, more than likely you don't have indexes up on the label/property used for lookup, meaning for each row you're doing an entire label scan, which would explain the heap pressure.

Please use an EXPLAIN on your load query, and if you see NodeByLabelScan it means you aren't using index lookups, and should create indexes (or unique constraints) to make your matches quick and ease up heap pressure.

Hi Andrew,

OK understood.

I did start to look at the indexing over the weekend as I also thought that maybe an issue.

I will get this in place and feedback as required.

Thank you for the feedback.

All the Best
Mike

Hi Andrew,

Thank God!

It works and was fast!

Now I can move forward!

Thank you so much for that feedback.

All the Best
Mike