Too slow adding relationships to 1.8 billion nodes with CYPHER

alexshapsoncoe · October 31, 2019, 1:10pm

Hello,

I am creating a graph with 1.8 billion nodes, and several billion relationships.

I can create the nodes (which are called 'segments') very quickly with :

USING PERIODIC COMMIT 100000 LOAD CSV WITH HEADERS FROM 'file:///node_file' AS row
CREATE (s:Segment {segment_id: row.seg_id});

But once I have a graph with 1.8 billion nodes, adding relationships is far too slow - taking 44 hours just to add one batch of 5 million relationships, using this:

USING PERIODIC COMMIT 100000 LOAD CSV WITH HEADERS FROM 'file:///relations_file' AS row
MATCH (a:Segment),(b:Segment) WHERE a.segment_id = row.label_a AND b.segment_id = row.label_b
CREATE (a)-[rel:CONTAINS {pair_id: row.pair_id, iou: toFloat(row.iou)}]->(b);

It is this slow whether or not I add the constraint:

CREATE CONSTRAINT ON (s:Segment) ASSERT s.segment_id IS UNIQUE;

And whether or not I index the nodes:

CREATE INDEX ON :Segment(segment_id)

So am I doing something wrong or sub-optimally, or does it just take a lot of time to add each relationship to a graph with 1.8 billion nodes.

I am submitting my CYPHER commands via py2neo, if it makes a difference.

Thanks for any advice you can offer!

werner · October 31, 2019, 1:17pm

Hmmm, could be the match statement that is slowing it down. Is there a way of doing it without the MATCH in one cypher like MERGE (a:type {key:'value'),(b:type {key:'value') MERGE (a)-[:type]->(b) or something ?

Then could you use the following to analyze you cypher : PROFILE at the beginning of you cypher in neo4j browser so we can see what part is taking the longest and how many DB hits it is doing. Then we can see what part to optimize

werner · October 31, 2019, 1:19pm

And welcome to the community

alexshapsoncoe · October 31, 2019, 2:40pm

Hi werner,

Thanks. I had assumed that the MATCH statement would be very fast to identify nodes, if they had been indexed, but this may just reflect my naivite about the CYPHER language.

Just for clarity, each of my 1.8 billion nodes is of the type 'Segment', and has only one identifier, 'segment_id'

When I add my relationships with LOAD CSV, each row of the CSV has features 'label_a', and 'label_b', which correspond to the 'segment_id' identifiers of the 2 nodes between which a relationship should be added.

Each row of the csv also has a number called 'iou', which should be included in the relationship.

So, given this, and following your suggestion, the MATCH-free command I have come up with is:

USING PERIODIC COMMIT 100000 LOAD CSV WITH HEADERS FROM 'file:///relationships_file.csv' AS row

MERGE (a:Segment {segment_id : row.label_a),(b:Segment {segment_id : row.label_b)
MERGE (a)-[:CONTAINS {iou: toFloat(row.iou)}]->(b);

Have I followed the correct syntax?

Thanks a lot!

Alex

andrew_bowman · October 31, 2019, 4:50pm

44 hours for 5 million relationships is definitely not normal or expected.

I think the previous approach with creating nodes first, then a second query to match on the nodes and create the relationships should be correct.

Can you provide an EXPLAIN plan of the query (with all elements of the plan expanded, use the double-down arrow in the lower right of the results pane)?

You might also want to lower your batch size for the relationship creation operation, try anywhere from 20000 to 50000.

Also try running this via the browser or cypher-shell, take py2neo out of the equation right now.

alexshapsoncoe · October 31, 2019, 7:33pm

Hi Andrew Borman, I am not sure what an EXPLAIN query is, but switching to the browser was useful because it gave me a warning about this line:

MATCH (a:Segment),(b:Segment) WHERE a.segment_id = row.label_a AND b.segment_id = row.label_b

the warning is:

" This query builds a cartesian product between disconnected patterns. If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (b))"

Another warning is:

"Using LOAD CSV followed by a MATCH or MERGE that matches a non-indexed label will most likely not perform well on large data sets. Please consider using a schema index."

So, I guess that I thought I had an index but I didn't. Previously I had run:

CREATE CONSTRAINT ON (s:Segment) ASSERT s.segment_id IS UNIQUE;

Which I think makes an index as part of creating the constraint?

I had also tried:

CREATE INDEX ON s:Segment(segment_id)

But perhaps I need to additionally reference this index when using the MATCH?

alexshapsoncoe · October 31, 2019, 7:45pm

additionally, I have just checked what indexes are available in my database with 'CALL db.indexes', and it showed that I have one index with the following characteristics:

description: "INDEX ON :Segment(segment_id)"
indexName: "index_1"
tokenNames: ["Segment"]
properties: ["segment_id"]
state: "POPULATING"
type: "node_label_property"
progress: 20.758981704711914

So, is it possible that the index is still being created, and is not available to use in MATCH queries? Is that what POPULATING means?

Do I additionally need to reference the index specifically when submitting the MATCH query?

Thanks

Alex

mark.needham · November 1, 2019, 1:08pm

POPULATING means that the index is still being created. You should see the status 'ONLINE' once everything is indexed and the index is ready to use.

alexshapsoncoe · November 4, 2019, 1:30pm

Hi Mark Needham,

I started again, and made the index in the neo4j browser, waiting until it said 'ONLINE'.

Then I attempted to upload approximately 5 million relations, using the following command (in the neo4j browser), explicitly referring to the index:

USING PERIODIC COMMIT 10000 LOAD CSV WITH HEADERS FROM 'file///input_file' AS row
MATCH (a:Segment {segment_id : row.label_a}) USING INDEX a:Segment(segment_id)
MATCH (b:Segment {segment_id : row.label_b}) USING INDEX b:Segment(segment_id)
CREATE (a)-[rel:CONTAINS {iou: row.iou}]->(b);

However, 12 hours later, only 1 million of these relations have been created. So the performance is still too slow. Am I using the MATCH and INDEX correctly?

alexshapsoncoe · November 4, 2019, 3:50pm

Hi, I ran the above code but for just 200 relationships, with the PROFILE command, so that it would complete and tell us where is was too slow. Here is the output from the command:

andrew_bowman · November 4, 2019, 7:53pm

That ValueHashJoin may be the culprit. That's an eager operator, I don't think that belongs in a query like this.

If you do an EXPLAIN of the query without restricting it to 200, what does the query plan look like? Is that ValueHashJoin still there:

alexshapsoncoe · November 4, 2019, 8:13pm

Thanks - I have done an EXPLAIN of the query without restricting to 200, and the ValueHashJoin is still there (see pic below). Can you recommend how I should change my query to avoid creating it? Thanks a lot for your help so far!

exlpain%20pt%202

alexshapsoncoe · November 10, 2019, 7:04pm

Hi all, just wanted to flag this problem again. To recap, I am trying to add billions of relations to existing nodes (which have been indexed by their feature 'segment_id'), using the following lines:

USING PERIODIC COMMIT 10000 LOAD CSV WITH HEADERS FROM 'file///input_file' AS row
MATCH (a:Segment {segment_id : row.label_a}) USING INDEX a:Segment(segment_id)
MATCH (b:Segment {segment_id : row.label_b}) USING INDEX b:Segment(segment_id)
CREATE (a)-[rel:CONTAINS {iou: row.iou}]->(b);

While the explicit use of an index has removed the warning about a cartesian product, the EXPLAIN plan of this approach (see pics above) reveals use of the Eager operation 'ValueHashJoin'. Currently each relationship is taking 30ms to add, which means that adding 5 million relationships is taking about 44 hours.

I am not expert enough in neo4j to know what ValueHashJoin is doing, but I understand it is related to the 'MATCH' command. If you could suggest an alternative format for this query that would avoid using this 'ValueHashJoin' component, that would be great.

Thanks in advance for any help that anyone can offer!

OscarSouth · April 9, 2020, 9:06am

Has there been any further discovery here?

I'm trying to add about 4 million nodes to a graph of a few hundred thousand existing nodes then connect them.

This is the code to create the new nodes. It runs in a matter of minutes. I'm sending it in batches over the http api. The (:Contact) nodes already exist in the graph.

UNWIND $batch as row
CREATE (n:Purchase)
SET n += row

This is the code to create the relationships, which is issued inside the same HTTP request. It runs pretty quickly and correctly on toy datasets of 250 rows and it does run the full dataset, but after leaving it for 10+ hours, only around 5% of the nodes and relationships have been added to the graph.

UNWIND $batch as row
MATCH (from:Contact{ContactID: row.ContactID})
MATCH (to:Purchase{ContactID: row.ContactID})
MERGE (from)-[r:MADE]->(to)

The ContactID property is indexed/constrained (as appropriate) on both sides.

Any way I can possibly speed this up?

andrew_bowman · April 9, 2020, 7:14pm

How many records per batch, and how are you batching (via the client?)

Please add an EXPLAIN plan from the query, with all elements expanded.

hernandezurbina · July 20, 2021, 9:26am

Hi there!

Is there an update on this? I'm struggling to create relationships among existing nodes. I'm creating nodes and edges via cypher-shell and a text file containing cypher commands. First I created around 20K nodes. Cypher-shell ran without issues. But then when trying to connect nodes with another script cypher-shell freezes or hangs without giving any information about its progress. The script to create connections consists of 2 parts: one in which I match nodes, and another in which I create the corresponding connections. I also created an index on a node property called "image_id" beforehand.

An example of a command to create a node in the first script looks like this:
CREATE (i_20040:Image {image_id: 'MDteiLH1CZY', image_name: 'SetA_MDteiLH1CZY.jpg', cluster: 99})

whereas an example to create connections looks like this:
(from the matching section)
MATCH (node_0: Image {image_id: '-kBXeZqJtMY'})
(and then from the connecting section)
MERGE (node_0)-[:IS_SIMILAR]->(node_1) MERGE (node_1)-[:IS_SIMILAR]->(node_0)

I would appreciate any hints on what I am missing or doing wrong. Thank you!

Topic		Replies	Views
Loading Neo4J Relationships takes literally ages Cypher	9	1318	March 12, 2020
Creating relationship over several millions of nodes Cypher apoc , performance , cypher , relationship	23	2906	September 24, 2020
How long should it take to create several millions of relationships? Cypher	3	348	March 5, 2021
Create relationship between two existing nodes without introducing new nodes Cypher performance , cypher , import	3	8951	December 15, 2019
Less efficiency while performing create relation with neo4j from python Operations relationship	15	547	October 27, 2021

Demystifying Neo4j UX Research

Too slow adding relationships to 1.8 billion nodes with CYPHER

Related topics