Too slow adding relationships to 1.8 billion nodes with CYPHER

Hello,

I am creating a graph with 1.8 billion nodes, and several billion relationships.

I can create the nodes (which are called 'segments') very quickly with :

USING PERIODIC COMMIT 100000 LOAD CSV WITH HEADERS FROM 'file:///node_file' AS row
CREATE (s:Segment {segment_id: row.seg_id});

But once I have a graph with 1.8 billion nodes, adding relationships is far too slow - taking 44 hours just to add one batch of 5 million relationships, using this:

USING PERIODIC COMMIT 100000 LOAD CSV WITH HEADERS FROM 'file:///relations_file' AS row
MATCH (a:Segment),(b:Segment) WHERE a.segment_id = row.label_a AND b.segment_id = row.label_b
CREATE (a)-[rel:CONTAINS {pair_id: row.pair_id, iou: toFloat(row.iou)}]->(b);

It is this slow whether or not I add the constraint:

CREATE CONSTRAINT ON (s:Segment) ASSERT s.segment_id IS UNIQUE;

And whether or not I index the nodes:

CREATE INDEX ON :Segment(segment_id)

So am I doing something wrong or sub-optimally, or does it just take a lot of time to add each relationship to a graph with 1.8 billion nodes.

I am submitting my CYPHER commands via py2neo, if it makes a difference.

Thanks for any advice you can offer!

Hmmm, could be the match statement that is slowing it down. Is there a way of doing it without the MATCH in one cypher like MERGE (a:type {key:'value'),(b:type {key:'value') MERGE (a)-[:type]->(b) or something ?

Then could you use the following to analyze you cypher : PROFILE at the beginning of you cypher in neo4j browser so we can see what part is taking the longest and how many DB hits it is doing. Then we can see what part to optimize

And welcome to the community

Hi werner,

Thanks. I had assumed that the MATCH statement would be very fast to identify nodes, if they had been indexed, but this may just reflect my naivite about the CYPHER language.

Just for clarity, each of my 1.8 billion nodes is of the type 'Segment', and has only one identifier, 'segment_id'

When I add my relationships with LOAD CSV, each row of the CSV has features 'label_a', and 'label_b', which correspond to the 'segment_id' identifiers of the 2 nodes between which a relationship should be added.

Each row of the csv also has a number called 'iou', which should be included in the relationship.

So, given this, and following your suggestion, the MATCH-free command I have come up with is:

USING PERIODIC COMMIT 100000 LOAD CSV WITH HEADERS FROM 'file:///relationships_file.csv' AS row

MERGE (a:Segment {segment_id : row.label_a),(b:Segment {segment_id : row.label_b)
MERGE (a)-[:CONTAINS {iou: toFloat(row.iou)}]->(b);

Have I followed the correct syntax?

Thanks a lot!

Alex

44 hours for 5 million relationships is definitely not normal or expected.

I think the previous approach with creating nodes first, then a second query to match on the nodes and create the relationships should be correct.

Can you provide an EXPLAIN plan of the query (with all elements of the plan expanded, use the double-down arrow in the lower right of the results pane)?

You might also want to lower your batch size for the relationship creation operation, try anywhere from 20000 to 50000.

Also try running this via the browser or cypher-shell, take py2neo out of the equation right now.

Hi Andrew Borman, I am not sure what an EXPLAIN query is, but switching to the browser was useful because it gave me a warning about this line:

MATCH (a:Segment),(b:Segment) WHERE a.segment_id = row.label_a AND b.segment_id = row.label_b

the warning is:

" This query builds a cartesian product between disconnected patterns. If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (b))"

Another warning is:

"Using LOAD CSV followed by a MATCH or MERGE that matches a non-indexed label will most likely not perform well on large data sets. Please consider using a schema index."

So, I guess that I thought I had an index but I didn't. Previously I had run:

CREATE CONSTRAINT ON (s:Segment) ASSERT s.segment_id IS UNIQUE;

Which I think makes an index as part of creating the constraint?

I had also tried:

CREATE INDEX ON s:Segment(segment_id)

But perhaps I need to additionally reference this index when using the MATCH?

additionally, I have just checked what indexes are available in my database with 'CALL db.indexes', and it showed that I have one index with the following characteristics:

description: "INDEX ON :Segment(segment_id)"
indexName: "index_1"
tokenNames: ["Segment"]
properties: ["segment_id"]
state: "POPULATING"
type: "node_label_property"
progress: 20.758981704711914

So, is it possible that the index is still being created, and is not available to use in MATCH queries? Is that what POPULATING means?

Do I additionally need to reference the index specifically when submitting the MATCH query?

Thanks

Alex

POPULATING means that the index is still being created. You should see the status 'ONLINE' once everything is indexed and the index is ready to use.

Hi Mark Needham,

I started again, and made the index in the neo4j browser, waiting until it said 'ONLINE'.

Then I attempted to upload approximately 5 million relations, using the following command (in the neo4j browser), explicitly referring to the index:

USING PERIODIC COMMIT 10000 LOAD CSV WITH HEADERS FROM 'file///input_file' AS row
MATCH (a:Segment {segment_id : row.label_a}) USING INDEX a:Segment(segment_id)
MATCH (b:Segment {segment_id : row.label_b}) USING INDEX b:Segment(segment_id)
CREATE (a)-[rel:CONTAINS {iou: row.iou}]->(b);

However, 12 hours later, only 1 million of these relations have been created. So the performance is still too slow. Am I using the MATCH and INDEX correctly?

Hi, I ran the above code but for just 200 relationships, with the PROFILE command, so that it would complete and tell us where is was too slow. Here is the output from the command:

That ValueHashJoin may be the culprit. That's an eager operator, I don't think that belongs in a query like this.

If you do an EXPLAIN of the query without restricting it to 200, what does the query plan look like? Is that ValueHashJoin still there:

Thanks - I have done an EXPLAIN of the query without restricting to 200, and the ValueHashJoin is still there (see pic below). Can you recommend how I should change my query to avoid creating it? Thanks a lot for your help so far!

exlpain%20pt%202

Hi all, just wanted to flag this problem again. To recap, I am trying to add billions of relations to existing nodes (which have been indexed by their feature 'segment_id'), using the following lines:

USING PERIODIC COMMIT 10000 LOAD CSV WITH HEADERS FROM 'file///input_file' AS row
MATCH (a:Segment {segment_id : row.label_a}) USING INDEX a:Segment(segment_id)
MATCH (b:Segment {segment_id : row.label_b}) USING INDEX b:Segment(segment_id)
CREATE (a)-[rel:CONTAINS {iou: row.iou}]->(b);

While the explicit use of an index has removed the warning about a cartesian product, the EXPLAIN plan of this approach (see pics above) reveals use of the Eager operation 'ValueHashJoin'. Currently each relationship is taking 30ms to add, which means that adding 5 million relationships is taking about 44 hours.

I am not expert enough in neo4j to know what ValueHashJoin is doing, but I understand it is related to the 'MATCH' command. If you could suggest an alternative format for this query that would avoid using this 'ValueHashJoin' component, that would be great.

Thanks in advance for any help that anyone can offer!